Simple regression for correcting ΔCt bias in RT-qPCR low-density array data normalization
© Cui et al.; licensee BioMed Central. 2015
Received: 28 August 2014
Accepted: 22 January 2015
Published: 14 February 2015
Reverse transcription quantitative PCR (RT-qPCR) is considered the gold standard for quantifying relative gene expression. Normalization of RT-qPCR data is commonly achieved by subtracting the Ct values of the internal reference genes from the Ct values of the target genes to obtain ΔCt. ΔCt values are then used to derive ΔΔCt when compared to a control group or to conduct further statistical analysis.
We examined two rheumatoid arthritis RT-qPCR low density array datasets and found that this normalization method introduces substantial bias due to differences in PCR amplification efficiency among genes. This bias results in undesirable correlations between target genes and reference genes, which affect the estimation of fold changes and the tests for differentially expressed genes. Similar biases were also found in multiple public mRNA and miRNA RT-qPCR array datasets we analysed. We propose to regress the Ct values of the target genes onto those of the reference genes to obtain regression coefficients, which are then used to adjust the reference gene Ct values before calculating ΔCt.
The per-gene regression method effectively removes the ΔCt bias. This method can be applied to both low density RT-qPCR arrays and individual RT-qPCR assays.
Reverse transcription quantitative PCR (RT-qPCR) has long become the gold standard for quantifying relative gene expression to study normal and pathological cell processes. Low density RT-qPCR arrays improve the throughput without losing the benefit of individual PCR reactions [1-3]. Although some data-driven normalization methods, such as quantile  and rank invariant  procedures, have been proposed and applied , the most common practice is based on the endogenous internal references, often referred to as “housekeeping” genes as for individual RT-qPCR experiments. Comparison to reference genes offers multiple practical advantages but the use of this strategy relies on the premise that these genes are expressed at the same level across a number of experimental conditions under investigation. However, no endogenous controls have been found to be constantly expressed across all different tissues, developmental stages, and study conditions [7,8]. Thus, a large number of papers focus on identifying stable references for various organisms, tissues, and conditions [9-16], given the critical nature of the quality of the comparisons and the implications for hypothesis testing of expression levels.
Conventional normalization of RT-qPCR data entails first identifying the appropriate reference genes, then subtracting the Ct (threshold cycle) values of the best reference gene or the Ct mean of several reference genes from all the target genes to obtain the normalized (calibrated) ΔCt for further comparison [17,18]. This type of normalization is based on the assumption that the Ct values of the target genes have a linear relationship with those of the reference genes and that the regression coefficient is 1. In this paper, we show, with RT-qPCR array data collected from rheumatoid arthritis patients, that the relationship is linear but the coefficient is not 1 and varies among different reference genes. Under this circumstance, ΔCt is biased. Using a variety of publicly available datasets, we show that this bias is widespread and not related to the physiologic or pathologic process under analysis. Furthermore, we demonstrate that PCR amplification efficiency varies substantially across genes, which is likely the cause of this bias. Methods have been proposed to take into account of the amplification efficiency in the normalization [19,20]; however, they involve estimating amplification efficiencies of targets and references using dilution series, which is not practical for RT-qPCR arrays. We propose a simple regression method for removing ΔCt bias. This method can be applied not only to RT-qPCR arrays but also assays for individual genes.
ΔCt normalization introduces bias
Regression on reference genes
Similar bias from other mRNA RT-qPCR array datasets
Reference gene regression coefficients from gene expression datasets
Similar bias from microRNA PCR array datasets
Control gene regression coefficients from microRNA datasets
Regression coefficients vary among target genes
Amplification efficiencies differ among genes
Regression coefficients of C t values on dilution factors
Impact on differential expression analyses
We conducted some simple simulation studies to compare the fold change estimates and false/true positive rates between ΔCt normalization and per-gene regression normalization. Our results showed that the per-gene regression normalization increase the precision of fold change estimates (Additional file 1: Figure S2) and the power for detecting differential expressions especially when the regression coefficient is far from 1 and the variation is not too large. The false positive rate of the regression normalization is well controlled around the expected level while that of the ΔCt normalization is inflated when there is a mean Ct difference between the comparing groups for the control gene (Additional file 2: Table S3 and Additional file 2: Table S4). The inflation of false positive rate from ΔCt normalization enlarges along the decrease of target gene Ct variance and the increase of sample size.
Our study showed that even with a universally constant reference gene, the ΔCt method tends to introduce large bias. Although the Ct values of the target genes are positively correlated with the reference gene, the regression coefficients are often substantially different from 1. We believe that a more appropriate method is to estimate the coefficient using regression and then subtract the reference gene Ct values adjusted by the regression coefficient.
Using three target genes and three reference genes as example, we demonstrated that the RT-qPCR amplification efficiencies are different among genes, which results in the deviation of the regression coefficients from 1 for some combinations of target and reference genes. Under ideal conditions, all primers/probes pairs should have amplification efficiency at close to 100% (http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_040377.pdf). Otherwise, the amplification efficiency should be estimated [23-25] and incorporated into the normalization procedure. Unfortunately, dilution curves or amplification dynamics for estimating the amplification efficiency of each gene is not a pragmatic method in RT-qPCR experiments. Given the cost of low density RT-qPCR arrays, it is even less practical to run dilution curves. Therefore, a simple remedy is to use regression for each target gene in the normalization instead of direct subtraction of Ct values.
Linear regression is a simple and effective way to estimate the normalization coefficients. However, one potential downside is that it can be easily affected by outlier data points. In our analysis, we removed outlier data points before normalization to avoid this problem. An alternative way is to apply a robust regression to combine these two steps together. Attention is also needed when combining RT-qPCR datasets. When individual datasets are normalized separately, the regression coefficients and intercepts can be different. If this is the case, the normalized data based on different regression coefficients will still have potential mean differences, which needs to be adjusted before combining the datasets.
When multiple reference genes are used as controls, they do not always give similar regression coefficients. We showed that using the mean Ct values of all reference genes for regression can achieve most of the normalization goal. However multiple regression analysis does a better job at simultaneously removing all dependency on all reference genes. We have found that the coefficients for some reference genes are fairly large while they are close to zero for others (Additional file 2: Table S2). Therefore, using only the reference genes with large coefficients will usually work well. One downside of multiple regression is that when the sample size of the RT-qPCR experiment is small, for example no bigger than the number of reference genes, the multiple regression will over-fit due to the lack of degree of freedom for residuals. In this situation, the number of reference genes used has to be reduced by selecting the best one or using the average. It is important to point out that multiple regression normalization is less stringent than global mean normalization because it does not force the mean Ct values of all samples to be the same. It only removes the correlation with reference genes.
When regression-based normalization is conducted for low density RT-qPCR array data, there is the choice of using the mean target Ct values of all target genes for a single regression or regression for each target gene on the array. Our results from the RA-SAB dataset showed that the mean regression was just one constant shift from the ΔCt normalization when fold change is concerned. The per-gene regression resulted in more differences due to the regression coefficient differences among genes. The fold changes obtained from the per-gene regression normalization were smaller and p values were larger than those from conventional subtraction normalization. This is likely the result of bias removal. When correlation between normalized target gene Ct and control Ct is introduced by subtraction normalization, fold change has two components, the true fold change between the two comparing groups and the difference related to the mean control difference. For example, even if two groups have equal mean expressions, the two group means of the normalized ΔCt values will still be different when the data points from the two groups are located in different areas in a panel of Figure 2. The size of this difference depends on the slope of regression and the mean difference of the control gene Ct values. Therefore, ΔCt normalization gives larger fold changes, which results in smaller p values. Our simulation results largely confirmed this speculation. Bias related to ΔCt normalization could be one reason for larger fold changes obtained from RT-qPCR than those from other high-throughput technologies, such as microarrays. Given that RT-qPCR has been considered as “gold standard” for quantifying gene expression, the general thoughts about this discrepancy have been that microarray somehow “squashes” the fold changes. Given our findings in this study, an alternative explanation is that RT-qPCR sometimes inflates fold changes due to ΔCt bias. This is consistent with the observations that fold changes from microarray and RNA-Seq have been found to be very similar in some studies [26,27].
One limitation of our regression-based normalization is that it works well when the sample size of the experiment is fairly large, such as our example (n = 60) and the GEO datasets (n ≥ 12). It can be problematic for very small sample sizes, such as just a few. Our simulations showed that the reduction of false positives and gain of power diminishes when total sample size goes down to 10 when variation is large. For RT-qPCR experiments on single or a few genes, dilution series are needed and practical for estimating amplification efficiencies, which can then be taken into account in normalization. For RT-qPCR array experiments with small number of samples, dilution series is less practical due to the cost. In this case, the amplification efficiency can be estimated based on the PCR kinetic curve [24,25]. However, kinetic curves have to be obtained for each gene from the PCR machine, which is not a standard practice of RT-qPCR. If these methods are not applied, investigators need to be aware of the existence of potential bias associated with ΔCt normalization in differential expression. In addition, we recommend use regression-based normalization when a statistically significant correlation between the Ct values of target genes and controls is detected; otherwise, the regression-based normalization is not beneficial.
The ΔCt normalization method often introduces bias due to amplification efficiency differences, which affects the estimation of fold change and the identification of differentially expressed genes. This bias can be effectively corrected by estimating the regression coefficient for each target gene and adjusting their ΔCt values accordingly.
Rheumatoid arthritis datasets
Two low-density PCR arrays were used to generate gene expression data from peripheral blood cells of patients with rheumatoid arthritis (RA). The first array was the Innate and Adaptive Immune Responses PCR Array from SABiosciences (Frederick, MD), which has 84 genes involved in the host response to bacterial infection and sepsis with 5 reference genes (http://www.sabiosciences.com/rt_pcr_product/HTML/PAHS-052A.html). The second array was the TaqMan Human Immune Array from Applied Biosystems (Foster City, CA), which contains 90 genes involved in stress response, signal transduction, cytokines/receptors, cell surface receptors, oxidoreductase, chemokines, protease, and cell cycle. Six reference genes are included as internal controls (http://tools.lifetechnologies.com/content/sfs/brochures/cms_042394.pdf). 60 RNA samples from peripheral blood (collected in PAXGene tubes) were studied from 40 African-Americans with RA and 20 African-American healthy controls. All patients and controls were from the CLEAR (Consortium for the Longitudinal Evaluation of African-Americans with Rheumatoid Arthritis) Registry [28,29] (http://www.uab.edu/medicine/rheumatology/research/70-clear). This study was approved by the Institutional Review Board (IRB) for Human Use of the University of Alabama at Birmingham (UAB IRB Human Subjects Protocol # X080219016). All participants and controls signed informed consent forms, and all human subject research was in compliance with the Helsinki Declaration. Standard protocols recommended by the manufacturer were used for preparing cDNA, PCR amplification, and quantification. PCR amplification was conducted using the Applied Biosystems Prism 7900HT sequence detection system.
Public datasets from Gene Expression Omnibus (GEO)
We selected RT-qPCR array datasets that have large or moderate sample size. Raw data were downloaded for each dataset (Additional file 2: Table S1). The median Ct value from the technical replicates of each gene was used for further analyses.
PCR amplification efficiency experiment
We selected six genes with three reference genes (GAPDH, RPS9, and RPL13A) and three target genes (IFNGR1, IRF1, and LY96). TaqMan® Assays recommended by the manufacture as most efficient for quantifying gene expression from each gene were purchased from Life Technologies (Grand Island, NY). Seven concentrations (1/16, 1/8, 1/4, ½, 1, 2, and 4 fold of the original cDNA concentration) were used for examining amplification efficiency. The reactions were performed on an Applied Biosystems QuantStudioTM 6 Flex Real-Time PCR System (384-well, 15 uL reaction volume/well). Three technical replicates were conducted for each sample and all samples were on the same plate.
RT-qPCR data filtering: The median was used to summarize the three technical replicates from the same sample. For the RA data, we filtered out the Ct values that were equal to 40 (undetectable) for downstream analysis. For GEO datasets, Ct values of 35, 39, 40, or “undetermined” were filtered out depending on the truncation value of the dataset for non-detection. Outlier Ct values were identified for each reference gene as more than 1.5 times of inter quartile range beyond the first and third quartiles. Samples with more than one reference genes deemed as outliers were removed from calculating the regression coefficients to avoid outlier effect. For testing of differential expression between two groups we used the Wilcoxon Rank Sum test after filtering the undetectable samples. The differences of group means were used to represent the fold changes for comparison of normalization methods. The data organization were coducted using Microsoft Excel. Statistical tests and plots for figures were conducted in R (version 3.0.0). All plots were generated using the plot function in R.
Proposed regression-based normalization
After removing undetectable target genes, follow three simple steps for each target gene. 1) Remove samples with outlier control gene expressions. 2) Regress target gene Ct values onto a control gene Ct values to obtain regression coefficient (b) and test for its significance; 3) If b is significant, conduct the normalization as Ct_target – b x Ct_control. When there are multiple control genes and a large enough sample size, conduct multiple regression with all control genes in the model as dependant variables to estimate their regression coefficients. Perform normalization by subtracting all control gene Ct values multiplied by their corresponding regression coefficients.
The RA-qPCR array data associated with this study have been submitted to NCBI GEO with accession number GSE64708.
The authors thank the CLEAR investigators for their collaboration. CLEAR investigators include: S. Louis Bridges, Jr., MD, PhD, and George Howard DrPH (University of Alabama at Birmingham); Doyt L. Conn, MD (Emory University); Beth L. Jonas, MD and Leigh F. Callahan, PhD (University of North Carolina); Ed A. Smith, MD (Medical University of South Carolina); Richard D. Brasington, MD (Washington University); Larry W. Moreland, MD (University of Pittsburgh); Ted R. Mikuls, MD, MSPH (University of Nebraska Medical Center). The authors thank Elizabeth Perkins for technical assistance and Peter K. Gregersen (The Feinstein Institute for Medical Research) for valuable comments on the manuscript. This work is Supported by National Institutes of Health N01 AR-6-2278, P60 AR048095, P60 AR064172, and R01 AR057202 (SL Bridges, Jr.), K23 AR062100 (MI Danila), and K01 AR060848 (RJ Reynolds). Funding for open access charge: National Institutes of Health.
- Osman F, Leutenegger C, Golino D, Rowhani A. Comparison of low-density arrays, RT-PCR and real-time TaqMan RT-PCR in detection of grapevine viruses. J Virol Methods. 2008;149:292–9.View ArticlePubMedGoogle Scholar
- Arany ZP. High-throughput quantitative real-time PCR. Curr Protoc Hum Genet. 2008;Chapter 11:Unit 11.10.PubMedGoogle Scholar
- VanGuilder HD, Vrana KE, Freeman WM. Twenty-five years of quantitative PCR for gene expression analysis. Biotechniques. 2008;44:619–26.View ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93.View ArticlePubMedGoogle Scholar
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001;98(1):31–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Mar JC, Kimura Y, Schroder K, Irvine KM, Hayashizaki Y, Suzuki H, et al. Data-driven normalization strategies for high-throughput quantitative RT-PCR. BMC Bioinformatics. 2009;10:110.View ArticlePubMed CentralPubMedGoogle Scholar
- Blanquicett C, Johnson MR, Heslin M, Diasio RB. Housekeeping gene variability in normal and carcinomatous colorectal and liver tissues: applications in pharmacogenomic gene expression studies. AnalBiochem. 2002;303(2):209–14.Google Scholar
- Huggett J, Dheda K, Bustin S, Zumla A. Real-time RT-PCR normalisation; strategies and considerations. Genes Immun. 2005;6(4):279–84.View ArticlePubMedGoogle Scholar
- Cui X, Zhou J, Qiu J, Johnson MR, Mrug M. Validation of endogenous internal real-time PCR controls in renal tissues. AmJ Nephrol. 2009;30(5):413–7.View ArticleGoogle Scholar
- Hruz T, Wyss M, Docquier M, Pfaffl MW, Masanetz S, Borghi L, et al. RefGenes: identification of reliable and condition specific reference genes for RT-qPCR data normalization. BMC Genomics. 2011;12:156.View ArticlePubMed CentralPubMedGoogle Scholar
- Li YP, Bang DD, Handberg KJ, Jorgensen PH, Zhang MF. Evaluation of the suitability of six host genes as internal control in real-time RT-PCR assays in chicken embryo cell cultures infected with infectious bursal disease virus. Vet Microbiol. 2005;110(3-4):155–65.View ArticlePubMedGoogle Scholar
- De Boever S, Vangestel C, De Backer P, Croubels S, Sys SU. Identification and validation of housekeeping genes as internal control for gene expression in an intravenous LPS inflammation model in chickens. Vet Immunol Immunopathol. 2008;122(3-4):312–7.View ArticlePubMedGoogle Scholar
- Fu J, Bian L, Zhao L, Dong Z, Gao X, Luan H, et al. Identification of genes for normalization of quantitative real-time PCR data in ovarian tissues. 2010. p. 1–7.Google Scholar
- Hong Cai J, Deng S, Kumpf S, Lee P, Zagouras P, Ryan A, et al. Validation of rat reference genes for improved quantitative gene expression analysis using low density arrays. Biotechniques. 2007;42:503–12.View ArticleGoogle Scholar
- Chervoneva I, Li Y, Schulz S, Croker S, Wilson C, Waldman SA, et al. Selection of optimal reference genes for normalization in quantitative RT-PCR. BMC Bioinformatics. 2010;11:253.View ArticlePubMed CentralPubMedGoogle Scholar
- Marum L, Miguel A, Ricardo CP, Miguel C. Reference gene selection for quantitative real-time PCR normalization in Quercus suber. PLoS One. 2012;7:e35113.View ArticlePubMed CentralPubMedGoogle Scholar
- Pfaffl MW, Tichopad A, Prgomet C, Neuvians TP. Determination of stable housekeeping genes, differentially regulated target genes and sample integrity: BestKeeper--excel-based tool using pair-wise correlations. Biotechnol Lett. 2004;26(6):509–15.View ArticlePubMedGoogle Scholar
- Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 2002;3(7):RESEARCH0034.View ArticlePubMed CentralPubMedGoogle Scholar
- Pfaffl MW. A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res. 2001;29:e45.View ArticlePubMed CentralPubMedGoogle Scholar
- Hellemans J, Mortier G, De Paepe A, Speleman F, Vandesompele J. qBase relative quantification framework and software for management and automated analysis of real-time quantitative PCR data. Genome Biol. 2007;8:R19.View ArticlePubMed CentralPubMedGoogle Scholar
- Mestdagh P, Van Vlierberghe P, De Weer A, Muth D, Westermann F, Speleman F, et al. A novel and universal method for microRNA RT-qPCR data normalization. Genome Biol. 2009;10:R64.View ArticlePubMed CentralPubMedGoogle Scholar
- Meyer SU, Kaiser S, Wagner C, Thirion C, Pfaffl MW. Profound effect of profiling platform and normalization strategy on detection of differentially expressed microRNAs–a comparative study. PLoS One. 2012;7:e38946.View ArticlePubMed CentralPubMedGoogle Scholar
- Livak KJ, Schmittgen TD. Analysis of relative gene expression data using real-time quantitative PCR and the 2(−Delta Delta C(T)) Method. Methods. 2001;25:402–8.View ArticlePubMedGoogle Scholar
- Liu W, Saint DA. A new quantitative method of real time reverse transcription polymerase chain reaction assay based on simulation of polymerase chain reaction kinetics. Anal Biochem. 2002;302:52–9.View ArticlePubMedGoogle Scholar
- Tichopad A. Standardized determination of real-time PCR efficiency from a single reaction set-up. Nucleic Acids Res. 2003;31:122e–122.View ArticleGoogle Scholar
- Guo Y, Sheng Q, Li J, Ye F, Samuels DC, Shyr Y. Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data. PLoS One. 2013;8:e71462.View ArticlePubMed CentralPubMedGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17.View ArticlePubMed CentralPubMedGoogle Scholar
- Bridges SL, Causey ZL, Burgos PI, Huynh BQN, Hughes LB, Danila MI, et al. Radiographic severity of rheumatoid arthritis in African Americans: results from a multicenter observational study. Arthritis Care Res (Hoboken). 2010;62:624–31.View ArticleGoogle Scholar
- Reynolds RJ, Cui X, Vaughan LK, Redden DT, Causey Z, Perkins E, et al. Gene expression patterns in peripheral blood cells associated with radiographic severity in African Americans with early rheumatoid arthritis. Rheumatol Int. 2013;33:129–37.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.