Adjustment method for microarray data generated using two-cycle RNA labeling protocol
© Wang et al.; licensee BioMed Central Ltd. 2013
Received: 18 May 2012
Accepted: 26 December 2012
Published: 16 January 2013
Skip to main content
© Wang et al.; licensee BioMed Central Ltd. 2013
Received: 18 May 2012
Accepted: 26 December 2012
Published: 16 January 2013
Microarray technology is widely utilized for monitoring the expression changes of thousands of genes simultaneously. However, the requirement of relatively large amount of RNA for labeling and hybridization makes it difficult to perform microarray experiments with limited biological materials, thus leads to the development of many methods for preparing and amplifying mRNA. It is addressed that amplification methods usually bring bias, which may strongly hamper the following interpretation of the results. A big challenge is how to correct for the bias before further analysis.
In this article, we observed the bias in rice gene expression microarray data generated with the Affymetrix one-cycle, two-cycle RNA labeling protocols, followed by validation with Real Time PCR. Based on these data, we proposed a statistical framework to model the processes of mRNA two-cycle linear amplification, and established a linear model for probe level correction. Maximum Likelihood Estimation (MLE) was applied to perform robust estimation of the Retaining Rate for each probe. After bias correction, some known pre-processing methods, such as PDNN, could be combined to finish preprocessing. Then, we evaluated our model and the results suggest that our model can effectively increase the quality of the microarray raw data: (i) Decrease the Coefficient of Variation for PM intensities of probe sets; (ii) Distinguish the microarray samples of five stages for rice stamen development more clearly; (iii) Improve the correlation coefficients among stamen microarray samples. We also discussed the necessity of model adjustment by comparing with another simple adjustment method.
We conclude that the adjustment model is necessary and could effectively increase the quality of estimation for gene expression from the microarray raw data.
Gene expression microarrays are widely utilized for transcriptome analysis of biological samples from different treatments or different phenotypic groups. However, due to the difficulty of extracting sufficient amount of starting mRNA or total RNA from biological materials, such as rice stamen, it is usually impossible to detect the amounts and sequences of mRNA directly. Therefore, amplification of mRNA sample is necessary before performing microarray experiment to detect the fluorescence signals of mRNA [1–4]. RNA linear amplification technology, based on T7 RNA polymerase and in vitro transcription (IVT) (Affymetrix, Santa Clara, CA, USA), gradually becomes a mostly used protocol for target preparation in microarray experiments , mainly for three reasons. First, it reduces the required amount of starting materials to 1 ∼100 ng of total RNA. Second, the bias from the amplification is smaller than typical PCR for DNA. Third, it only requires mRNA or total RNA, rather than DNA, hence leads to a wider application .
It is also notable that, for mRNA transcripts, the decreasing trend of the two-cycle labeling is heavier than that of the one-cycle labeling, which implies that we are probably misled. Three possible reasons may account for this phenomenon. The first one is the degradation of transcript due to RNA’s instability, especially in 16 hour’s in vitro transcription (IVT) stage during the amplification process. Another reason might be the usage of random primers in the start of the second round amplification, which concludes uncompleted reverse transcription from aRNA to cDNA. The third reason may be the saturation of probe spot on microarray chip. Therefore, it is necessary to revise the microarray data generated with two-cycle RNA amplification before using it to perform further analysis, such as detecting differentially expressed genes, constructing co-expression gene network and so forth.
A few works before have also mentioned the problem of RNA degradation during two-cycle labeling [7–12]. However, there is almost no universally applicable solutions reported to deal with this problem. In this study, we proposed a statistical framework to model the process of mRNA two-cycle linear amplification, and established a linear model to revise the expression intensity at probe level. Probe level correction in this study means bias correction for intensity of Perfect Match probe (PM) if there is no special instruction.
This paper proceeds as follows. In the second section, we described three types of microarray experiments and a Real Time PCR experiment to validate the degradation effect. After that, we proposed a probabilistic model for probe level adjustment, and the parameters are estimated by MLE. In the third section, we displayed the bias existed in microarray experiment and validated the bias by Real Time PCR. Then we discussed the effects that might bring bias, especially the degradation effect in the RNA two-cycle labeling protocol. By applying our model to two stamen microarray data sets, we showed that our model adjustment obtained a significant improvement for the quality of microarray raw data. Finally, we discussed the necessity of our model adjustment and a possible application to RNA-seq data in the last section.
In this chapter, we first briefly introduced three kinds of microarray data of rice, and then described the designed Real Time PCR experiment to revalidate transcriptional degradation. For brevity, we refer to these three data sets as Data Set 1, Data Set 2 and Data Set 3, respectively. Plant materials as well as sample collection and total RNA isolation method are the same in three data sets. Data Set 1 was designed to show the phenomena of RNA degradation and to estimate parameters in the model. Data Set 2 was designed to validate the reproducibility of the bias from RNA degradation and to determine whether the bias is sample specific. Data Set 3 was applied to validate the efficiency of our adjusting model. After that we established a probabilistic model to estimate the extent of bias caused by RNA degradation during two-cycle linear amplification and attempted to correct for this bias. As mentioned above, the reason why bias is introduced is mainly due to the using of random primers and RNA degradation, on which our model is based. The saturation effect of micoarray chip is beyond our consideration.
Rice (O. sativa L. ssp. japonica cv. Zhonghua 15) seeds were soaked in water at 30°C for about one week. And then seedlings were transplanted in plastic pots and cultured at 30 ±2°C under 11 hours day/13 hours night cycle.
The rice embryos were obtained from the seeds soaked in water for 2-3 days. The third mature expanded leaves were collected when they had just fully expended. Rice seedlings were harvested when they were 2-3 cm in height. The panicles were collected when they were 0.6-4 cm in length. Total RNA was isolated separately from rice embryos, the third mature leaves, the fifth mature leaves, seedlings and panicles with TRIzol Reagent from Invitrogen Life Technologies.
To obtain microarray data covering almost all genes in rice, total RNA from different samples described above was mixed into a big pool called Pre-amplified mRNA sample (PAM) by almost equal magnitude. Eight subgroups were separated from the purified PAM. The half of them were amplified and labeled by the One-Cycle Eukaryotic Target Labeling Assay from Affymetrix (Santa Clara, CA, USA). The resulting samples were called One-Cycle cRNA samples (OCS). The other half were amplified and labeled by the Two-Cycle Eukaryotic Target Labeling Assay from Affymetrix (Santa Clara, CA, USA). The resulting samples were called Two-Cycle cRNA samples (TCS). All experimental procedures strictly followed instructions specified in the Affymetrix GeneChip Expression Analysis Technical Manual.
This data set contained microarray data from two different samples, including the third leaf primordium and the third mature leaves of rice. Total RNA was extracted from the two samples. And then each of them was separately amplified by two methods, the One-Cycle Labeling Assay and the Two-Cycle Labeling Assay (Affymetrix, Santa Clara, CA, USA). Two replicates were applied for each Assay. Totally there are 8 hybridization samples for the following microarray experiments.
This data set included 21 slides of Affymetrix microarray. There are 7 different stages from rice, including the third leaf primordium, the third mature leaves, and stamen samples at stage 2, 3, 4, 5 and 6 (Stages are partitioned by the development of Stamen) [13, 14]. Biological replicates are applied 3 times for each of them. All the 21 mRNA samples are amplified with Two-Cycle Labeling Assay (Affymetrix, Santa Clara, CA, USA).
All amplified RNA samples were hybridized with the Affymetrix GeneChip Rice Genome oligonucleotide arrays, which is widely used for rice gene expression analysis. The.CEL files are available at http://www.math.pku.edu.cn/teachers/dengmh/RiceTCAC/index.html.
To validate the degradation trend mentioned above, we developed a strategy by applying Real Time PCR to several transcripts. Real Time PCR were carried out using the same RNA samples as that of Data Set 1, including Pre-amplified mRNA samples (PAM), One-Cycle cRNA samples (OCS), and Two-Cycle cRNA samples (TCS).
In brief, primers were designed for transcripts using Real Time PCR Primer Design tool (http://www.genscript.com/cgi-bin/tools/primer_genscript.cgi?op=standard). Each primer was further confirmed by dissociation curve analysis after the PCR reactions. First strand cDNA was synthesized by reverse transcription using 1 μg of total RNA in 20 μl of reaction volume using SuperScriptTM III Reverse Transcriptase from Invitrogen Corporation. Diluted cDNA samples were used for Real Time PCR analysis with 100 nM of each primer mixed with SYBR Green PCR master following manufacturer’s instructions. The reactions were carried out in Optical 96-Well Fast Plate on the 7500 Fast System (Applied Biosystems, USA).
D P i j is a number between 0 and 1, where 0 means no degradation effect and 1 indicates the amplicon is completely degraded. The closer D P i j is to 0, the less the region of ith amplicon is affected by amplification in jth sample, and vice versa.
To better understand the model, we introduce the notation. To simplify calculation, we first assume that RNA is degrading at an even speed, regardless the sequence effect, e.g. GC content. Let 0 and L (base pair, bp) be the positions of 3’ end and 5’ end of a transcript before amplification, where L is the length of the transcript. Let A i and B i be the position of the new 3’ and 5’ end after the ith shorten (i = 1,2,3). It’s obvious that A 0 = 0, B 0 = L. (i) After the first cycle amplification, based on the uniform assumption of RNA degradation, we further assume A 1 ~ U[A 0,A 0 + a] and B 1 ~ U[B 0 − b,B 0], where a,b are the degradation limits of 3’ end and 5’ end separately. (ii) In the second cycle of amplification, 3’ and 5’ ends varies more complicated. As random primer is used in the step of the first strand cDNA synthesis from antisense RNA produced in the first cycle of amplification, which may result in incomplete synthesis. 5’ end is decreased while 3’ end remains the same. Therefore, the position of new 3’ end A 2 is the same as A 1, while the position of new 5’ end B 2 ~ U[A 1,B 1], as random primer can bind to any position of the cRNA. (iii) As the same as in the first cycle, degradation takes place at both 3’ end and 5’ end in the IVT steps of the second cycle amplification. So it’s similarly concluded that A 3 ~ U[A 2,A 2 + a] and B 3 ~ U[B 2 − b,B 2], where a,b are the same as in the first cycle.
But for the second cycle amplification, the Retaining Rate function p 2(z) is complicated piecewise, associated with F 3(x,y). We calculated it by integrating with R software.
For determined j, k and all i, , where is unknown. Then least square estimation becomes maximum likelihood estimation.
where i = 1,2,…n j , n j is the number of probes in the jth probe set.
Although the above model only uses PM of probe for training the parameters, it can also be used to adjust MM intensity, as Retaining Rate represents the probability of the nucleotide of each position on transcript to be remained after two-cycle amplification. As our model only correct for the bias caused by degradation during amplification at probe level, the corrected intensity didn’t perform the background correction or normalization at probe set level. We think it necessary to perform a further pre-processing to adjust for bias introduced by cross-hybridization or other factors. After adjusting for PM and MM, we could apply many existing popular normalization methods, such as PDNN , dchip  and RMA  to preprocess the modified microarray data before further investigation. In this study, we choose PDNN as it is more powerful by considering the sequence binding information [18, 19].
In the beginning of the first cycle amplification, a transcript is complete poly(A) RNA, whose cap in 5’ ends and tail in 3’ ends to some extent protect the RNA sequence. Thus, degradation of the transcript is relatively slow in this process. However, when the second cycle begins, as the RNA cap in 5’ end and tail in 3’ end have degraded or almost degraded, the degrading effect becomes heavier and thus leads the RNA sequence to be shorter . Besides, using random primer also leads the RNA sequence to become shortened directly. As a result, it implies that the key effect of degradation may hide in the second cycle. Combining with the protocol of the Two-Cycle Labeling Assay (Affymetrix, Santa Clara, CA, USA), we postulated that there may be three possible reasons for the bias listed as follows. The first effect is RNA degradation in amplification process. With regard to RNA’s instability comparing with DNA, degradation would probably take place in the tube of RNA amplification, even in the initial RNA period which the degraded loss would amplify in the end. Second, random primers used in the second cycle have dubitable uncertainties that, unlike the oligo (dT) primer, can bind not only to 3’ poly(A) end of RNA, but also to the middle or the other end of transcript, concluding cDNA’s uncompleted reverse transcription from amplified RNA, which leads greater bias after the second cycle amplification. Third, in some cases, real expression intensity on some probe spots are underestimated, because of the saturation of the microarray chip.
Regardless the decaying speed of a mRNA varying by different ribonucleic acids and its secondary structures, we assume that a ribonucleic acid on mRNA degrades linearly relied on the distance to an end, either 3’ end or 5’end. Also, we assume that there exists a degrading limit, behind which degradation seldom takes place. We are clear that mRNA degrades from both 5’ and 3’ ends. In one pathway, mRNA shortening is followed by removal of the 5’ cap structure. Decapping gives a 5’-to-3’ exonuclease access to degrade the remainder of the mRNA . In the other pathway, poly (A) shortening is followed by 3’-to-5’ digestion by a complex of exonucleases named the exosome. The exosome is distinct from the exonuclease that removes the poly (A) tail . But poly (A) shortening is absent in microarray data, for hybridization losing the poly (A) fragments. Thus, degradation of 3’ end in microarray data is slighter and more easily treated. Another reason why we first consider degradation of 3’ end is that 5’ end’s decline attributes not only to decaying, but also to incomplete reverse transcription by random primers. For all 631066 probes of 57381 probe sets of rice microarray, we collected their distance away from 3’ end of transcript (Each probe represented by its middle nucleotide), and plotted the number of probes at the same position to show its distribution (Additional file 1: Figure S1). It’s obvious that most probes are designed to have distances less than 577 bp from 3’ end, and the number decreases between 30 bp and 577 bp. A similar gradient for 235700 probes of 21407 present probe sets (selected by MAS5.0) is shown in Additional file 1: Figure S1. For present probes, we calculated the mean of PM intensities of probes at the same distance (Figure 1). Two lines, respectively, are plotted from the same biological sample, microarray data using one-cycle (Green) and two-cycle (Red) amplification. It is obvious that, with the same trend, the two lines increase before the left of 133 bp from 3’end, and then decrease until 577 bp. No apparent pattern appears after that, because low number of probes at the same position breaks stability, i.e, there are a small number of probes on transcript further than 600 bp from 3’end, which make it to be of no statistical significance. From Figure 1, it’s obvious that variation of mean intensity in the two cycle labeling is much larger and the bias trend is much more serious than the one cycle labeling. Besides, leaf samples using the Affymetrix one-cycle and two-cycle RNA labelling protocols with the Rice microarray in Data Set 2 shows similar degradation trend (See Additional file 1: Figure S2), which demonstrates that this situation is technique specific rather than sample specific.
In conclusion, our Real Time PCR results indicated the correlation between position and intensity of probes in amplification processes: the probe close to the 5’ end would be underestimated because of degradation and random primer effect, while the probe very close to the 3’ end would be also underestimated because of 3’ degradation; the probe located but still had a distance to 3’ end might have the most exact expression measurement of probe set.
To further evaluate the biological meaning of our adjusting model, we created 21 rice microarray samples (Data Set 3,15 rice stamen samples, 3 leaf samples and 3 bud samples) using the Affymetrix two-cycle RNA labeling.
Several works have considered the effect of mRNA degradation in microarray during the process of two-cycle linear amplification as well as the bias it caused , however, almost no or less work has been reported to establish adjusting methods to solve this problem.
In this study, focusing on the process of two-cycle linear amplification and the features of gene expression microarray data, we found two key factors that could bring biases: (i) Reverse transcription by random primers; (ii) RNA degradation in the first and second cycle amplification. We modeled these processes with uniform distributions and established a model to correct the bias. As we performed the adjustment for PM at probe level, one could further apply any kinds of microarray preprocessing methods, such RMA, dchip or PDNN to perform normalization. Note that we could not only apply the model to PM, but also to MM, as the Retaining Rate measures how probable each position of transcript remains after amplification, which has no relationship to binding mechanism. Other preprocessing methods considering MM could also be combined to do further reliable normalization.
To evaluate our model, we applied the adjustment method to microarray data of Data Set 1 and 3. From three aspects: (i) Coefficient of Variation for probes within a probe set decreases significantly, (ii)The clustering diagram shows more reasonable classification for rice stamen samples and (iii) the relationship among them become be much more closer, we could see that our model had obtained relatively more biologically reasonable results.
To demonstrate the necessity of our model for adjusting, we applied another adjusting method that assigning different weight to each position of transcript according to mean PM expression intensity. The main steps are designed as follows: (i) Plot the mean PM intensity for probes at 12 ∼588bp (probes in 98.68% of present probests) of transcripts. (ii) Apply lowess (locally weighted scatterplot smoothing) to fit the data. Then compute loess smoothed values for all points along the curve. Normalize all loess smoothed values to make their mean to be 1. Take the reciprocal of the normalized value at each position as the weight for probes at this position (See Figure S5). (iii) Adjust PM at each position by multiplying the PM intensity by the weight. (iv) Combine with known preprocessing methods (PDNN, or RMA). We call this process of adjusting Curve Adjustment (CA for short). To compare CA with our method, we applied both of them to Data Set 1 and 3. We could see from Figure S6 that, the CV didn’t decrease and the clustering of 15 samples were almost the same as that of none adjustment (Figure S7). Besides, the sample correlation coefficients didn’t raised much after CA (Figure S8). Thus, these results indicate that direct curve adjustment for microarray data is not suitable and our model adjustment is necessary. See Additional file 1 for more details.
Although we could correct for the bias to some extent, there are still some challenges, such as the assumption of the distribution for random primer to be union, which may be different from actual condition. Besides, we applied a simple linear model to simulate RNA degradation and didn’t consider the different degradation rate caused by different nucleotides, such as GC-content, which may play an important role in two-cycle amplification. A new challenge will be the combination of both position of probe and sequence preference to make more accurate correction of bias.
With the development of Next Generation Sequencing, RNA-seq is widely utilized to measure gene expression at transcriptional level at unprecedented precision and throughput [20–24]. Our model may have further applications, as most RNA-seq Library Preparation Protocols also require RNA amplification. In this article, we just provide a basic idea of correcting for bias in microarray raw data to get more accurate result for further analysis, which may shed light on the adjusting methods for RNA-seq data.
Although the approaching of Next Generation Sequencing leads to more accurate results of gene expression, it is now still relatively too expensive. Microarray will still be a feasible way to measure gene expression in the near future. Our method could be applied to reuse the existing microarry data generated with two cycle amplification protocols, and the more biological promising results obtained by the adjusting model will surely benefit the following analysis, e.g. detecting differentially expressed genes or gene set analysis.
This work was supported by the National Natural Science Foundation of China (No.31171262, No.11021463), the National Key Basic Research Project of China (No.2009CB18503).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.