Improving reliability and absolute quantification of human brain microarray data by filtering and scaling probes using RNA-Seq
© Miller et al.; licensee BioMed Central Ltd. 2014
Received: 26 November 2013
Accepted: 30 January 2014
Published: 24 February 2014
High-throughput sequencing is gradually replacing microarrays as the preferred method for studying mRNA expression levels, providing nucleotide resolution and accurately measuring absolute expression levels of almost any transcript, known or novel. However, existing microarray data from clinical, pharmaceutical, and academic settings represent valuable and often underappreciated resources, and methods for assessing and improving the quality of these data are lacking.
To quantitatively assess the quality of microarray probes, we directly compare RNA-Seq to Agilent microarrays by processing 231 unique samples from the Allen Human Brain Atlas using RNA-Seq. Both techniques provide highly consistent, highly reproducible gene expression measurements in adult human brain, with RNA-Seq slightly outperforming microarray results overall. We show that RNA-Seq can be used as ground truth to assess the reliability of most microarray probes, remove probes with off-target effects, and scale probe intensities to match the expression levels identified by RNA-Seq. These sequencing scaled microarray intensities (SSMIs) provide more reliable, quantitative estimates of absolute expression levels for many genes when compared with unscaled intensities. Finally, we validate this result in two human cell lines, showing that linear scaling factors can be applied across experiments using the same microarray platform.
Microarrays provide consistent, reproducible gene expression measurements, which are improved using RNA-Seq as ground truth. We expect that our strategy could be used to improve probe quality for many data sets from major existing repositories.
KeywordsAllen Brain Atlas Microarray RNA-Seq High-throughput sequencing Transcriptome profiling Reliability Gene expression Brain
RNA-Seq and related sequencing-based technologies are gradually emerging as the preferred method for genome-wide transcriptional analyses, as they provide several potential advantages over hybridization-based microarray technologies [1–5]. Fragment counts from RNA-Seq more reliably track absolute gene expression levels (as measured by quantitative PCR) than the fluorescence- or intensity-based measures obtained using DNA microarrays [2, 3]. Microarray intensities can also be prone to background noise and hybridization saturation, leading to a lower dynamic range than RNA-Seq [1, 2, 4]. Furthermore, as RNA-Seq does not require a priori probe selection, it allows unbiased analysis of the entire transcriptome, including measurements of gene isoforms, noncoding RNAs, novel transcripts , and base-level transcriptional changes. But RNA sequencing technologies do not always represent the most appropriate strategy for large scale transcriptomics. In particular, comparison between new and historical data sets is often desired, and direct comparisons across platforms can be problematic [6, 7]. Currently, data from thousands of studies on all of the major microarray platforms are publicly available in databases such as ArrayExpress  and Gene Expression Omnibus (GEO) . These data have stimulated important advances in many biological areas over the past two decades, including classification of cancer subtypes ; identification of gene expression changes in many diseases; drug discovery [11, 12]; and novel insights into the evolution, development, structure, and dysfunction of the human brain [13–16]. Moreover, microarrays and related technologies are still used in the clinic to measure biomarkers for tumor classification, patient diagnosis, patient prognosis, and predicted response to treatment [17–20]. While there are several options, both commercial (i.e, GeneSpring) and open source (i.e., the "affy" and "limma" libraries in R), for microarray analysis and data quality assessment, to the best of our knowledge none take advantage of the improved absolute gene expression measurements from sequencing technologies.
Here we present the largest comparison between microarray and RNA-Seq to date, using samples from the Allen Human Brain Atlas [14, 21], a publicly available gene expression atlas of the human brain with microarray-based genome-wide transcriptional profiling of specific brain regions spanning all major anatomical structures of the adult brain. RNA aliquots from 231 unique samples across two adult human brains previously analyzed using Agilent microarrays were reprocessed using the Illumina Hiseq RNA-Seq technology, sequenced to a depth of 30 million reads. We find that both methods produce highly reproducible gene expression measurements. RNA-Seq performed slightly better in terms of reproducibility of measurements and detection of differential expression between regions as described previously [2, 3]. However, by treating the RNA-Seq as ground truth, we were able to improve microarray results. First, taking advantage of the high variability of gene expression levels across the adult human brain, we were able to identify the most reliable microarray probe for each gene and remove poorly behaving probes. Moreover, intensities for over 80% of probes could be scaled to provide highly reliable, quantitative estimates of absolute gene expression that should be transferable to any experiment using the same microarray. Finally, we propose an extension to our experimental setup which allows it to be applied to a greater number of probes, and across several microarray platforms. In summary, we find that microarray data can be improved by filtering and scaling probes to RNA-Seq expression values using a relatively small number of samples, and that both methods provide reproducible gene-level expression information that can lead to valuable biological insights.
RNA-Seq data preparation using RSEM alignment followed by TbT and ComBat normalization
Several methods for sequence alignment and gene expression quantification of RNA-Seq data have been developed (for review see ) including the Tuxedo Suite , RSeqTools , and RSEM . These methods each aim to summarize expression levels based on the number of reads that align to each gene, but differ in their treatment of splice junctions and ambiguous sequence alignments. For sequence alignment we used RSEM, which aligns reads to known isoforms and then calculates gene expression as the sum of isoform expression for a given gene, assigning ambiguous reads to multiple isoforms using a maximum likelihood statistical model . The resulting gene expression values are presented as transcripts per million (TPM) after scaling for gene length and for the total number of reads. Our analyses can be reproduced starting from these TPM values using Additional file 1 and data available from the Allen Brain Atlas data portal (http://www.brain-map.org).
Proper normalization of microarray data can remove non-biological differences between samples due to batch effects and differences between arrays. These RNA-Seq data showed minimal batch effects (Additional file 2; supplementary figure legends in Additional file 3), but could potentially still be improved with respect to variability after scaling to the total transcript count. One strategy for doing this is TbT normalization, which scales each sample based on the total number of reads found in genes that are not differentially expressed . This normalization strategy resulted in a slight improvement in data quality when considering cortical vs. non-cortical regions as the two sample groups. Specifically, we see a three-fold decrease in between-sample gene expression variability (based on standard deviation; Additional file 4) and improved between-brain reproducibility, as measured by between-brain correlation of differential expression across brain regions (detailed in next section; increase from R = 0.89 to R = 0.90; Additional file 4). After TbT normalization, we identified a systematic bias between samples from the two brains, in that many genes show a consistent change in expression between brains across many regions assayed. For example, samples from the two brains cluster distinctly for many brain regions (Additional file 2), and furthermore region and brain of origin make up most of the variance between samples (Additional file 4). Whether these between-brain differences are due to real biological differences in brain (i.e., due to age) or technical issues (i.e., due to RNA quality), these systematic differences detract from our ability to compare expression levels between brains, which is one of our primary strategies for assessing biological reproducibility. Therefore, to standardize gene expression data between brains we used ComBat , which is an empirical Bayes framework that was designed to remove batch effects from microarray data. In addition to removing the systematic bias between samples from the two brains (Additional file 4), we find that ComBat also improved between-brain reproducibility in our RNA-Seq data (increase from R = 0.90 to R = 0.92; Additional file 4), which justifies our use of this method in this context. Finally, we note that we used a comparable strategy to further normalize the subset of microarray data from the Allen Human Brain Atlas used in this study (Additional file 3), leading to comparable improvements in data quality (Additional file 4).
RNA-Seq only slightly outperforms microarray based on global reproducibility measures
To compare the dynamic range of both methods, we next plotted a histogram of gene expression levels across all samples in our study (Figure 2C). We note that this result will be highly dependent on sequencing depth, and that our results are based on approximately 30 million reads per sample. RNA-Seq shows higher sensitivity in quantifying genes with very low expression, as shown by the extended leftward tail in the bottom relative to the top plot. Consistent with this finding, more genes were identified as present by RNA-Seq compared to microarray, regardless of the number of samples assessed (Figure 2D). For example, approximately 80% of genes were found by microarray to be expressed in at least half of the samples, compared with approximately 90% by RNA-Seq, with the difference mostly in genes with low expression. However, very few genes identified as absent in RNA-Seq (i.e., no transcript fragments) were called present by microarray, suggesting a relatively low false positive rate in the Agilent present/absent call.
Most genes show expression patterns between brain regions that are highly consistent among individuals . To assess between-brain consistency in these data, we selected 100,000 random genes and 100,000 pairs of randomly selected non-neocortical brain regions in both brains, identified the log2 fold change of each gene between the corresponding pairs of regions, and then plotted these values between brains (Figure 2E-F). Using this strategy, we found highly correlated expression patterns between brains for both microarray (R = 0.84) as well as RNA-Seq (R = 0.92). We next directly compared the magnitude of differential gene expression between methods using the same strategy (Figure 2G). As with absolute expression levels, differential expression fold changes between methods are highly correlated (R = 0.78), at a level consistent with previous studies (Spearman = 0.73 between liver and kidney, for example ). These correlations are not as significant as between brains (Figure 2E), despite the fact that RNA from the same RNA samples were used in both methods, supporting results from previous studies that comparisons across platforms can be problematic [6, 7]. Finally, to identify specific genes showing significantly different patterns of expression between non-neocortical brain regions, we performed ANOVA on all samples from these areas. At several p-value thresholds we identified a highly overlapping set of differentially expressed genes, with more genes reaching significance using RNA-Seq than microarray (Figure 2H). Collectively, these global metrics show that, although Illumina sequencing technologies slightly outperform Agilent microarrays by all of these metrics, both methods can consistently and reproducibly evaluate expression levels in the adult human brain for a large percentage of genes.
Reproducibility dependent on gene expression level and gene size
RNA-Seq can filter Agilent microarray probes
In addition to choosing the best probe for each gene, this strategy can be used to assign each probe with a quality score (or pass/fail call) based on reproducibility, which could, for example, help eliminate genes from the analysis in which all probes show potential off-target effects or non-specific binding. In this case we score each probe based on the correlation, defining all probes with significant positive correlation as passing (Figure 4A). After correcting for multiple comparisons (q < 0.1), 82% of genes have at least one passing probe on the array, a number that decreases only to 68% if we consider as few as 16 carefully selected samples in the analysis (Additional file 7). After omitting the set of best probes that failed quality control, the remaining genes show marked improvements in between-method and between-brain reproducibility (Figure 4B; green bars; R = 0.87 vs. 0.85 and R = 0.88 vs. 0.86, respectively). Given our previous result showing that genes with low expression tended to show poor biological reproducibility (Figure 3B), we next compared the expression levels of our best probes that passed compared with those that failed (Figure 4C). Around half the probes with log2(intensity) < 3 passed, compared with more than 90% of probes with log2 (intensity) > 3, suggesting that some probes likely fail because the probe itself is bad, whereas other probes may be properly targeting their corresponding gene, but that gene is not expressed in the brain and therefore the between-method reproducibility cannot be accurately assessed. Strategies for recovering this second class of failed probes will be discussed later.
RNA-Seq can be used to improve microarray quality by scaling probes
To test whether SSMIs provide more biologically reproducible results than corresponding intensity scores, we repeated all of our quantitative assessments (see Figures 2 and 3) using SSMIs for the set of optimal probes chosen by RNA-Seq (see Figure 4). As hypothesized, absolute expression levels show a dramatic improvement in reproducibility between RNA-Seq and microarray, with Pearson correlations increasing in many cases to R > 0.95 (Figure 5B). For example, while many microarray probe intensities overestimate gene expression levels by several orders of magnitude, SSMIs for nearly all probes much more closely match TPM values determined by RNA-Seq (Figure 5C). It is important to emphasize that we see these improvements in the test set (brain 1) using scaling parameters calculated using an independent training set (brain 2). Along the same lines, between-method measures of consistency based on differential expression show similar improvements, in both the training set (brain 2; Figure 5D) and test set (brain 1). Furthermore, differential expression fold change correlations between brains based on SSMIs (R = 0.90) are nearly as high as those based on RNA-Seq TPMs (R = 0.92; Figure 5D), suggesting that after probe selection, filtering, and scaling, microarrays can nearly match sequencing technologies in certain measures of biological reproducibility. Comparable improvements can be found if we generate scaling parameters with as few as 16 samples (Additional file 7).
Identifying differentially expressed genes in neocortex using microarray and RNA-Seq
We have previously shown that differences in transcriptional patterns of distinct neocortical areas depend on the distance between these areas, although comparatively few genes show very high levels of differential expression in the neocortex . To assess the extent to which these more subtle expression relationships can be found using RNA-Seq as compared with microarray, we performed ANOVA on all samples from these 22 neocortical areas. RNA-Seq identified 3458 genes differentially expressed (p < 0.05, Bonferroni corrected), compared with 2144 identified using microarray intensities, of which 1121 agree between methods (p ~ 0). An additional 194 genes were identified using SSMIs instead of intensities, including 120 found by RNA-Seq that were not initially identified by microarray, showing that RNA-Seq scaling improved the sensitivity of microarrays to detect differential expression by approximately 10%, even when comparing relatively similar tissue. Thus, despite the relatively mild transcriptional differences between neocortical areas, we find common differentially expressed genes between methods. Collectively, these results demonstrate the biological reproducibility and applicability of both of these genome-wide transcriptional methodologies.
Optimizing gene expression measurements from extant microarray data
We presented an extensive comparison between Agilent microarray and RNA-Seq using 231 samples from the Allen Human Brain Atlas . Our analysis aimed to assess and potentially improve the quality of extant microarray data, and had three primary goals. First, we showed that Agilent microarrays generate highly reproducible expression measurements for many genes, both when comparing gene expression measurements with RNA-Seq as well as between individual brains. Our results are in line with previous studies comparing RNA-Seq and several different microarray platforms [1–5], in that we identified high correlations of absolute (R = 0.78) and differential (R = 0.78) expression levels between methods. We also demonstrated a high degree of biological reproducibility for genes with at least moderate expression (TPM > 1, log2[intensity] > 5), which progressively decreases for lower expression levels using both methods. Second, we have shown that RNA-Seq can be used to assess microarray probe quality, and that this can be done using a relatively small number of experimental samples. We saw progressively increasing biological reproducibility of gene expression measurements when we initially used these quality scores to identify the best microarray probe for each gene, and subsequently excluded genes from the analysis with no reliable probes.
Third, and most importantly, our study proposes for the first time the use of high throughput sequencing to scale microarray probe intensities to more closely reflect absolute gene expression levels. Previous studies have presented strategies for measuring absolute expression using microarray, for example, by co-hybridizing biological samples with calibrated reference samples on spotted-glass microarrays , and have sought to improve array quality by filtering out bad probes from Affymetrix probe sets . The advantage of our strategy is that these scaling parameters appear to be broadly applicable, as those derived from samples run in adult human brain improved the reliability of expression levels identified using the same array in two different hESC lines. In principal, these parameters (Additional file 8) could now be applied to any sample run using our custom array (or the widely-used Agilent 44K Whole Human Genome Microarray) to improve absolute expression quantification for around 80% of genes without needing to perform any additional sequencing. While these scaling parameters may not be optimal for improving data derived from other experimental conditions, the key point is that they can improve the data without the need for performing additional sequencing.
Microarrays and RNA-Seq both currently used in research and clinical settings
While RNA-Seq will likely replace microarrays in research and clinical settings in the near future due to the improved dynamic range and potential for finding novel transcripts and sequence level variations, microarray data sets are still highly valuable, and application of this method improves their interpretability. Currently, microarray data from thousands of experiments are available in public databases such as GEO  and ArrayExpress , providing valuable resources for cross-study comparisons between experiments utilizing the same transcriptional method. For example, both the Allen Human Brain Atlas [14, 21] and a companion BrainSpan atlas targeting prenatal human brain (http://www.brainspan.org) utilize the same Agilent array to facilitate between age comparisons, and have provided insight into the structure and function of the human brain. Furthermore, many research and clinical laboratories already have standard operating procedures in place for analyzing microarray data—including the required machinery, storage space, analysis tools, and expertise—which could be augmented in a relatively straightforward manner using our method, and it will take time to transition to sequencing-based strategies. Finally, until quite recently RNA-Seq techniques required more total RNA (100 ng-1 μg) than microarray [4, 36]. In our study this requirement limited the samples that could be included in the experimental design. However, newer RNA-Seq strategies that allow transcriptional profiling from single cells [37, 38] or even single nuclei  hold great promise in categorizing and understanding cortical cell types, and at potentially a fraction of the cost of microarrays. Thus in the near term, microarrays and RNA-Seq will both likely be used for high throughout gene expression analysis, and therefore any strategies for improving the accuracy of detecting and corroborating gene expression signal from microarrays will be helpful.
Limitations and suggested methodological improvements
One limitation of this analysis is that, in order to accurately assess probe quality and define scaling parameters, the variability across samples must be accurately measured. For example, we found that probes targeting genes with very low expression in brain were much more likely to be failed, compared with high expressers, and that such genes also showed less consistent expression patterns between brains. It is likely that with a more diverse tissue panel some of these low-expressing probes would be assessed as higher quality. Another possible limitation is that our scaling parameters derived from brain may not be applicable to other tissues for genes showing a high degree of differential isoform expression. Again, by using a more diverse tissue panel to calculate scaling parameters, we would expect that probes for such genes would not show consistent expression between methods, and would therefore be failed at the quality control step.
Using our current study as a starting point, we propose a methodology that would address these limitations, and could further be used to improve microarray data quality for many publicly available data sets and clinical applications. First, an RNA atlas of gene expression from several highly distinct tissues, organs, and cells lines would be collected, for example by contacting an accredited tissue bank, or in partnership with a related governmental program such as the Genotype-Tissue Expression (GTEx) Project (http://commonfund.nih.gov/GTEx/index). Our results suggest that approximately 16 carefully chosen samples would be sufficient, although increasing the number of samples would moderately improve the power to detect passing probes (Additional file 7). Previously published gene expression atlases on 46  and later 79  such tissues found that nearly 90% of expressed genes also showed some degree of differential expression, and these atlases could be used as filters for determining the most transcriptionally distinct tissues. Second, RNA from all of these tissues would be processed using RNA-Seq and several of the most commonly used microarray platforms in parallel. Considering only the most widely published array platforms for Affymetrix (HG-U133_Plus_2), Illumina (HumanHT-12 V3.0), and Agilent (014850 Whole Human Genome Microarray 4x44K G4112F), data for around 100,000 microarrays are currently hosted by Gene Expression Omnibus . Additional RNA aliquots could be stored for later processing using other microarray platforms. Third, quality assessments and scaling parameters for each probe (or in the case of Affymetrix, each probe set) of each microarray platform would be assessed as described above. Finally, the resulting values could be applied, in principle, to any data utilizing any of the microarray platforms included in this experiment. We expect that our strategy could also facilitate direct comparisons of data collected using different array platforms, although this hypothesis would need to be tested.
We showed that both Agilent microarrays and RNA-Seq can provide highly reproducible measurements of gene expression in the human brain. Furthermore, for a majority of genes, we were able to quantifiably assess the reproducibility of microarray probes, remove probes with off-target effects, and scale probe intensities to provide highly reliable, quantitative estimates of absolute gene expression levels. The scaling parameters generated using brain tissue appear to be applicable to other tissues, and are provided as a resource to the community. Overall, we calculated SSMIs values for 80% of the approximately 19,000 genes included in our between-method comparison with moderate confidence, and expect that many of the remaining genes could be scaled using a more diverse set of tissues, as proposed.
Post-mortem tissue acquisition and sample processing
Methods for post-mortem tissue acquisition and sample processing have previously been described ( and http://help.brain-map.org/display/humanbrain/Documentation). In short, tissue for the Allen Human Brain Atlas was provided by the NICHD Brain and Tissue Bank (Baltimore, MD) and the University of California Irvine Department of Psychiatry and Human Behavior Brain Donor Program (Irvine, CA), under approvals by Institutional Review Boards of the Maryland Department of Health and Hygiene and University of Maryland Baltimore, or the University of California Irvine, respectively, and with consent from next-of-kin. Specimens for microarray and RNA-Seq profiling were 24-year-old (Brain 1) and 39-year-old (Brain 2) African American males. Total RNA from 120 macrodissected samples initially processed for microarray from each brain (as described in ) were also processed for RNA-Seq. These samples included biological replicates from left and right hemisphere in 29 brain regions, as well as four technical replicates per brain. Aliquots of the same total RNA isolates generated from macro dissections for microarray were used for sequencing. RNA was sent to Expression Analysis Inc. (EA; Durham, NC) for library preparation and sequencing, of which 250 ng total RNA was input for each run. EA used the Illumina TruSeq library preparation protocol and performed paired-end, 50 bp sequencing on an Illumina HiSeq2000 instrument. The sequencing was run as 6-plex with target of 30 million reads per sample: 25-35 million reads per sample were generated. Processed data is available at the Allen Brain Atlas data portal (http://www.brain-map.org) in the form of gene counts, as well as TPM values.
RNA-Seq alignment and data normalization
Sequences were aligned to the genome using RNA-Seq by Expectation-Maximization (RSEM)  (see Additional file 3). Transcripts (isoforms) were defined using the knownGene table from UCSC Genome Browser (; http://genome.ucsc.edu; hg19, Feb. 2009). Summary expression levels for each gene were calculated in terms of both counts and TPM using this pipeline. Mapped read files were also converted to BAM file format for visualization using GenomeBrowse (Golden Helix, Bozeman, MT).
Microarray data normalization was performed as described on the Allen Human Brain Atlas data portal (http://help.brain-map.org/display/humanbrain/Documentation). In short, data is preprocessed for systematic biases, and quantile normalized to the 75th percentile in each batch. Across batches within each brain, data is normalized by aligning two sets of control samples included in all batches and capturing method related bias is adjusted by a modified quantile normalization method. We note that approximately 2,200 probes failed quality assessments during generation of the Allen Human Brain Atlas and were also excluded from this analysis. Across multiple brains non-biological difference is adjusted again by aligning two sets of control samples. RNA-Seq data was TbT normalized  in linear space, as described in the Results, with the differential expression vector defined as TRUE if a sample was from neocortex and FALSE otherwise. Samples were then scaled such that the total log2(TPM) remained unchanged after normalization. Data from both microarray and RNA-Seq were then ComBat normalized  in log2 space using a parametric prior, with the batches corresponding to brain of origin. Unless otherwise noted, ComBat normalized data were used for all comparisons.
Comparisons between microarray and RNA-Seq
The open source R software (http://www.r-project.org/) was extensively used for all analyses and visualizations (Additional file 3). Hierarchical clustering, bar plots showing expression levels, and MDS in two dimensions were strategies used for assessing data quality and evaluating the effect of normalization (Additional file 3). For all between-method comparisons, unless otherwise noted, a single microarray probe with the highest average expression level across regions  was selected to represent each of the approximately 19,000 commonly identified genes. A gene was defined as present in microarray if called present by the Agilent software, and in RNA-Seq if at least one fragment was aligned to that gene.
Pearson correlations (R) were calculated in several contexts as measures of consistency or reproducibility throughout (Additional file 3). Differential expression was measured in two ways. As a global measure of biological or technical reproducibility, fold changes of 100,000 random genes between randomly selected region pairs was compared between brains or methods, respectively, using Pearson correlation. Alternatively, a p-value for each gene was calculated by running ANOVA across either the 7 non-neocortical or 22 neocortical regions, representing transcriptionally diverse and relatively similar tissues, respectively. To assess biological reproducibility of gene subsets, genes were sorted based on average expression level or gene size, binned into 20 groups, and then the average and standard deviation of the across-region Pearson correlations for all genes in each bin were calculated.
Generation of microarray scaling parameters
SSMIs were calculated by performing a linear scaling on microarray intensities from brain 2 in log2 space. Normalized data prior to ComBat normalization were used for this analysis to preserve samples from brain 1 as independent and since ComBat normalized RNA-Seq data could less closely align to absolute expression levels than do TbT normalized data. For each gene, scaling parameters were found by identifying the 5th and 95th percent quantiles of expression in RNA-Seq and microarray and then linearly shifting microarray expression to match these quantiles with RNA-Seq (see Figure 5A). All probes were scaled, but only the probe for each gene with the highest between-method correlation was included in the final comparison between methods. Scaling confidence was estimated using the significance of correlation (a p-value output by the cor.test R function) and converting to q-values as a measure of false discovery rate using the R function qvalue . Probes with q < 0.1 were scaled, and remaining probes were omitted from the analysis. Finally, scaled array intensities were ComBat normalized as described above for comparison with other results.
Comparison with human hESC lines
RNA from two human pluripotent ESC lines, H1 and H9, or their cortical neuronal progeny derived using directed differentiation, were run on microarray and RNA-Seq (Additional file 3). In short, RNA was generated by lysing cells in RNAeasy buffer (Qiagen) and then following the standard RNAeasy protocol. RNA was then sent to Covance (for microarray) and Expression Analysis (for RNA-Seq). This is in contrast to RNA from brain, which was processed by Cogenics (for microarray) and Covance (for RNA-Seq), as described above. Correlations between methods were calculated both before and after SSMI scaling to demonstrate the effectiveness of these scaling parameters on absolute expression levels in other tissues.
Availability of supporting data
All data presented in this manuscript are available either at the Allen Institute data portal (http://www.brain-map.org) or as part of Additional file 1. Specifically, microarray data (both raw and normalized intensities), as well as sequence data summarized to the gene level (both fragment counts and TPM values) can be downloaded from the Allen Institute data portal by clicking on the "Human Brain" link and then the "Download" link. Additional file 1 contains the microarray and sequencing data for hESC lines as well as annotated code for Figures 2, 3, 4, 5 and 6.
Description of additional data files
The following additional data are available with the online version of this paper. Additional file 1 is a zip file including code and supporting data required to reproduce our analysis in R. Additional file 2 is a supplementary figure showing that RNA-Seq has minimal batch effects. Additional file 3 is a text document that includes Supplementary Methods and Supplementary Figure Legends. Additional file 4 is a supplementary figure showing that normalization improves the quality of RNA-Seq and microarray data. Additional file 5 is a supplementary figure showing that certain highly-expressed microarray probes do not accurately measure gene expression. Additional file 6 is a supplementary figure showing that gene expression reproducibility is dependent on transcript length. Additional file 7 is a supplementary figure showing that quality control and scaling of microarray probes can be done well with as few as 16 matched RNA-Seq samples. Additional file 8 is a supplementary table listing scaling parameters for the Agilent Microarray.
Transcripts per million
Standard error of the mean
Human embryonic stem cell
Sequencing scaled microarray intensity
RNA-Seq by expectation-maximization
The authors wish to thank the Allen Institute founders, Paul G. Allen and Jody Allen, for their vision, encouragement and support. The authors gratefully acknowledge the contributions of Allen Institute staff to the Allen Human Brain Atlas, especially A. Bongaarts and D. Bertagnolli for neuroanatomy expertise and RNA processing, respectively. We thank R. Martinez, A. Nelson, and Y. Wang for H1 and H9 cell maintenance and RNA preparation. WiCell and J.A. Thomson, University of Wisconsin-Madison, provided the original H1 and H9 stem cell lines. Tissue for the Allen Human Brain Atlas was provided by the NICHD Brain and Tissue Bank (H. R. Zielke, Director) and by the University of California Irvine Department of Psychiatry and Human Behavior Brain Donor Program (P. Cartagena, Associate Director) and Functional Genomics Laboratory (M. Vawter, Principal Investigator). The Allen Human Brain Atlas was supported in part by award numbers 1C76HF15069-01-00 and 1C76HF19619-01-00 from the Department of Health and Human Services Health Resources and Services Administration. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Department of Health and Human Services.
- Bradford JR, Hey Y, Yates T, Li Y, Pepper SD, Miller CJ: A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling. BMC Genomics. 2010, 11: 282-10.1186/1471-2164-11-282.PubMed CentralPubMedView ArticleGoogle Scholar
- Chen H, Liu Z, Gong S, Wu X, Taylor WL, Williams RW, Matta SG, Sharp BM: Genome-wide gene expression profiling of nucleus accumbens neurons projecting to ventral pallidum using both microarray and transcriptome sequencing. Front Neurosci. 2011, 5: 98-PubMed CentralPubMedView ArticleGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Raghavachari N, Barb J, Yang Y, Liu P, Woodhouse K, Levy D, O'Donnell CJ, Munson PJ, Kato GJ: A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease. BMC Med Genom. 2012, 5: 28-10.1186/1755-8794-5-28.View ArticleGoogle Scholar
- Sirbu A, Kerr G, Crane M, Ruskin HJ: RNA-Seq vs dual- and single-channel microarray data: sensitivity analysis for differential expression and clustering. PLoS ONE. 2012, 7: e50986-10.1371/journal.pone.0050986.PubMed CentralPubMedView ArticleGoogle Scholar
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005, 2: 345-350. 10.1038/nmeth756.PubMedView ArticleGoogle Scholar
- Jarvinen AK, Hautaniemi S, Edgren H, Auvinen P, Saarela J, Kallioniemi OP, Monni O: Are data from different gene expression microarray platforms comparable?. Genomics. 2004, 83: 1164-1168. 10.1016/j.ygeno.2004.01.004.PubMedView ArticleGoogle Scholar
- Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone SA: ArrayExpress update–from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009, 37: D868-D872. 10.1093/nar/gkn889.PubMed CentralPubMedView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMed CentralPubMedView ArticleGoogle Scholar
- Mischel PS, Cloughesy TF, Nelson SF: DNA-microarray analysis of brain cancer: molecular classification for therapy. Nat Rev Neurosci. 2004, 5: 782-792. 10.1038/nrn1518.PubMedView ArticleGoogle Scholar
- Altar A, Vawter M, Ginsberg S: Target identification for CNS diseases by transcriptional profiling. Neuropsychopharmacol Offic Publ Am College Neuropsychopharmacol. 2009, 34: 18-54. 10.1038/npp.2008.172.View ArticleGoogle Scholar
- Schadt EE, Friend SH, Shaywitz DA: A network view of disease and compound screening. Nat Rev Drug Discov. 2009, 8: 286-295. 10.1038/nrd2826.PubMedView ArticleGoogle Scholar
- Blalock E, Geddes J, Chen K, Porter N, Markesbery W, Landfield P: Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci U S A. 2004, 101: 2173-2178. 10.1073/pnas.0308512100.PubMed CentralPubMedView ArticleGoogle Scholar
- Hawrylycz M, Lein E, Guillozet-Bongaarts A, Shen E, Ng L, Miller J, van de Lagemaat L, Smith K, Ebbert A, Riley Z, Abajian C, Beckmann C, Bernard A, Bertagnolli D, Boe A, Cartagena P, Chakravarty M, Chapin M, Chong J, Dalley R, Daly B, Dang C, Datta S, Dee N, Dolbeare T, Faber V, Feng D, Fowler D, Goldy J, Gregor B: An anatomically comprehensive atlas of the adult human brain transcriptome. Nature. 2012, 489: 391-399. 10.1038/nature11405.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson M, Kawasawa Y, Mason C, Krsnik Ž, Coppola G, Bogdanović D, Geschwind D, Mane S, State M, Šestan N: Functional and evolutionary insights into human brain development through global transcriptome analysis. Neuron. 2009, 62: 494-509. 10.1016/j.neuron.2009.03.027.PubMed CentralPubMedView ArticleGoogle Scholar
- Oldham M, Horvath S, Geschwind D: Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci U S A. 2006, 103: 17973-17978. 10.1073/pnas.0605938103.PubMed CentralPubMedView ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
- Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Canc Cell. 2002, 1: 203-209. 10.1016/S1535-6108(02)00030-2.View ArticleGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.PubMed CentralPubMedView ArticleGoogle Scholar
- Geiss GK, Bumgarner RE, Birditt B, Dahl T, Dowidar N, Dunaway DL, Fell HP, Ferree S, George RD, Grogan T, James JJ, Maysuria M, Mitton JD, Oliveri P, Osborn JL, Peng T, Ratcliffe AL, Webster PJ, Davidson EH, Hood L, Dimitrov K: Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat Biotechnol. 2008, 26: 317-325. 10.1038/nbt1385.PubMedView ArticleGoogle Scholar
- Shen EH, Overly CC, Jones AR: The Allen human brain atlas: comprehensive gene expression mapping of the human brain. Trends Neurosci. 2012, 35: 711-714. 10.1016/j.tins.2012.09.005.PubMedView ArticleGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8: 469-477. 10.1038/nmeth.1613.PubMedView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralPubMedView ArticleGoogle Scholar
- Habegger L, Sboner A, Gianoulis TA, Rozowsky J, Agarwal A, Snyder M, Gerstein M: RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics. 2011, 27: 281-283. 10.1093/bioinformatics/btq643.PubMed CentralPubMedView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010, 26: 493-500. 10.1093/bioinformatics/btp692.PubMed CentralPubMedView ArticleGoogle Scholar
- Kadota K, Nishiyama T, Shimizu K: A normalization strategy for comparing tag count data. Algorithms Mol Biol. 2012, 7: 5-10.1186/1748-7188-7-5.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007, 8: 118-127. 10.1093/biostatistics/kxj037.PubMedView ArticleGoogle Scholar
- Kogenaru S, Qing Y, Guo Y, Wang N: RNA-seq and microarray complement each other in transcriptome profiling. BMC Genomics. 2012, 13: 629-10.1186/1471-2164-13-629.PubMed CentralPubMedView ArticleGoogle Scholar
- Venet D, Detours V, Bersini H: A measure of the signal-to-noise ratio of microarray samples and studies using gene correlations. PLoS ONE. 2012, 7: e51013-10.1371/journal.pone.0051013.PubMed CentralPubMedView ArticleGoogle Scholar
- Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinforma. 2010, 11: 94-10.1186/1471-2105-11-94.View ArticleGoogle Scholar
- Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B: Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011, 21: 1543-1551. 10.1101/gr.121095.111.PubMed CentralPubMedView ArticleGoogle Scholar
- Miller J, Cai C, Langfelder P, Geschwind D, Kurian S, Salomon D, Horvath S: Strategies for aggregating gene expression data: the collapseRows R function. BMC Bioinform. 2011, 12: 322-10.1186/1471-2105-12-322.View ArticleGoogle Scholar
- Lee CK, Sunkin SM, Kuan C, Thompson CL, Pathak S, Ng L, Lau C, Fischer S, Mortrud M, Slaughterbeck C, Jones A, Lein E, Hawrylycz M: Quantitative methods for genome-scale analysis of in situ hybridization and correlation with microarray data. Genome Biol. 2008, 9: R23-10.1186/gb-2008-9-1-r23.PubMed CentralPubMedView ArticleGoogle Scholar
- Dudley AM, Aach J, Steffen MA, Church GM: Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc Natl Acad Sci U S A. 2002, 99: 7554-7559. 10.1073/pnas.112683499.PubMed CentralPubMedView ArticleGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005, 33: e175-10.1093/nar/gni179.PubMed CentralPubMedView ArticleGoogle Scholar
- Wilhelm BT, Landry JR: RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods. 2009, 48: 249-257. 10.1016/j.ymeth.2009.03.016.PubMedView ArticleGoogle Scholar
- Hashimshony T, Wagner F, Sher N, Yanai I: CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep. 2012, 2: 666-673. 10.1016/j.celrep.2012.08.003.PubMedView ArticleGoogle Scholar
- Tang F, Barbacioru C, Nordman E, Li B, Xu N, Bashkirov VI, Lao K, Surani MA: RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat Protoc. 2010, 5: 516-535. 10.1038/nprot.2009.236.PubMedView ArticleGoogle Scholar
- Grindberg RV, Yee-Greenbaum JL, McConnell MJ, Novotny M, O'Shaughnessy AL, Lambert GM, Arauzo-Bravo MJ, Lee J, Fishman M, Robbins GE, Lin X, Venepally P, Badger JH, Galbraith DW, Gage FH, Lasken RS: RNA-sequencing from single nuclei. Proc Natl Acad Sci U S A. 2013, 110: 19802-19807. 10.1073/pnas.1319700110.PubMed CentralPubMedView ArticleGoogle Scholar
- Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A. 2002, 99: 4465-4470. 10.1073/pnas.012025199.PubMed CentralPubMedView ArticleGoogle Scholar
- Su A, Wiltshire T, Batalov S, Lapp H, Ching K, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke M, Walker J, Hogenesch J: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.PubMed CentralPubMedView ArticleGoogle Scholar
- Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, Raney BJ, Pohl A, Malladi VS, Li CH, Lee BT, Learned K, Kirkup V, Hsu F, Heitner S, Harte RA, Haeussler M, Guruvadoo L, Goldman M, Giardine BM, Fujita PA, Dreszer TR, Diekhans M, Cline MS, Clawson H, Barber GP: The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 2013, 41: D64-D69. 10.1093/nar/gks1048.PubMed CentralPubMedView ArticleGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003, 100: 9440-9445. 10.1073/pnas.1530509100.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.