### Materials and Methods

In this chapter, we first briefly introduced three kinds of microarray data of rice, and then described the designed Real Time PCR experiment to revalidate transcriptional degradation. For brevity, we refer to these three data sets as Data Set 1, Data Set 2 and Data Set 3, respectively. Plant materials as well as sample collection and total RNA isolation method are the same in three data sets. Data Set 1 was designed to show the phenomena of RNA degradation and to estimate parameters in the model. Data Set 2 was designed to validate the reproducibility of the bias from RNA degradation and to determine whether the bias is sample specific. Data Set 3 was applied to validate the efficiency of our adjusting model. After that we established a probabilistic model to estimate the extent of bias caused by RNA degradation during two-cycle linear amplification and attempted to correct for this bias. As mentioned above, the reason why bias is introduced is mainly due to the using of random primers and RNA degradation, on which our model is based. The saturation effect of micoarray chip is beyond our consideration.

#### Plant materials

Rice (O. sativa L. ssp. japonica cv. Zhonghua 15) seeds were soaked in water at 30°*C* for about one week. And then seedlings were transplanted in plastic pots and cultured at 30 ±2°*C* under 11 hours day/13 hours night cycle.

#### Microarray Data Set 1

The rice embryos were obtained from the seeds soaked in water for 2-3 days. The third mature expanded leaves were collected when they had just fully expended. Rice seedlings were harvested when they were 2-3 cm in height. The panicles were collected when they were 0.6-4 cm in length. Total RNA was isolated separately from rice embryos, the third mature leaves, the fifth mature leaves, seedlings and panicles with TRIzol Reagent from Invitrogen Life Technologies.

To obtain microarray data covering almost all genes in rice, total RNA from different samples described above was mixed into a big pool called Pre-amplified mRNA sample (PAM) by almost equal magnitude. Eight subgroups were separated from the purified PAM. The half of them were amplified and labeled by the One-Cycle Eukaryotic Target Labeling Assay from Affymetrix (Santa Clara, CA, USA). The resulting samples were called One-Cycle cRNA samples (OCS). The other half were amplified and labeled by the Two-Cycle Eukaryotic Target Labeling Assay from Affymetrix (Santa Clara, CA, USA). The resulting samples were called Two-Cycle cRNA samples (TCS). All experimental procedures strictly followed instructions specified in the Affymetrix GeneChip Expression Analysis Technical Manual.

#### Microarray Data Set 2

This data set contained microarray data from two different samples, including the third leaf primordium and the third mature leaves of rice. Total RNA was extracted from the two samples. And then each of them was separately amplified by two methods, the One-Cycle Labeling Assay and the Two-Cycle Labeling Assay (Affymetrix, Santa Clara, CA, USA). Two replicates were applied for each Assay. Totally there are 8 hybridization samples for the following microarray experiments.

#### Microarray Data Set 3

This data set included 21 slides of Affymetrix microarray. There are 7 different stages from rice, including the third leaf primordium, the third mature leaves, and stamen samples at stage 2, 3, 4, 5 and 6 (Stages are partitioned by the development of Stamen) [13, 14]. Biological replicates are applied 3 times for each of them. All the 21 mRNA samples are amplified with Two-Cycle Labeling Assay (Affymetrix, Santa Clara, CA, USA).

All amplified RNA samples were hybridized with the Affymetrix GeneChip Rice Genome oligonucleotide arrays, which is widely used for rice gene expression analysis. The.CEL files are available at http://www.math.pku.edu.cn/teachers/dengmh/RiceTCAC/index.html.

### Validation of RNA degradation by Real Time PCR

To validate the degradation trend mentioned above, we developed a strategy by applying Real Time PCR to several transcripts. Real Time PCR were carried out using the same RNA samples as that of Data Set 1, including Pre-amplified mRNA samples (PAM), One-Cycle cRNA samples (OCS), and Two-Cycle cRNA samples (TCS).

In brief, primers were designed for transcripts using Real Time PCR Primer Design tool (http://www.genscript.com/cgi-bin/tools/primerâˆ–_genscript.cgi?op=standard). Each primer was further confirmed by dissociation curve analysis after the PCR reactions. First strand cDNA was synthesized by reverse transcription using 1 *μ*g of total RNA in 20 *μ*l of reaction volume using SuperScriptTM III Reverse Transcriptase from Invitrogen Corporation. Diluted cDNA samples were used for Real Time PCR analysis with 100 nM of each primer mixed with SYBR Green PCR master following manufacturer’s instructions. The reactions were carried out in Optical 96-Well Fast Plate on the 7500 Fast System (Applied Biosystems, USA).

We designed several pairs of primers locating from 5’end to 3’end of a transcript. The target of each pair of primers was called amplicon. Then we used the pairs of primers to carry out Real-time PCR with PAM, OCS and TCS separately. The Schematic diagram of Real Time PCR experiments is shown in Additional file 1: Figure S3. In order to remove the amplification efficiency of different amplicons, expression value of amplicon in PAM was served as reference. We calculated the relative expression values of the ith amplicon in jth sample *R*_{
i
j
} (see Equation 1).

{R}_{\mathit{\text{ij}}}={2}^{-(C{T}_{(i,j)}-C{T}_{(i,\mathit{\text{PAM}})})}\phantom{\rule{2em}{0ex}}j=\{\mathit{\text{OCS}},\mathit{\text{TCS}}\}

(1)

Where CT (cycle threshold) is the number of cycles required for fluorescent signal to reach the threshold. The larger CT is, the less the amount of RNA is. For a fixed j, among all the amplicons of a transcript, amplicon with the maximum relative expression value (*M* *A* *X* (*R*_{
i
j
})) located in the region with minimal impact by degradation and amplification. And *M* *A* *X*(*R*_{
i
j
}) was closest to the actual expression level of the transcript. So we calculated the *D* *P*_{
i
j
} of each amplicon relative to the *M* *A* *X* (*R*_{
i
j
}) to make clear the impact of degradation and amplification on different regions of transcript:

D{P}_{\mathit{\text{ij}}}=1-\frac{{R}_{\mathit{\text{ij}}}}{\mathit{\text{MAX}}\left({R}_{\mathit{\text{ij}}}\right)}\phantom{\rule{2em}{0ex}}j=\{\mathit{\text{OCS}},\mathit{\text{TCS}}\}

(2)

*D* *P*_{
i
j
} is a number between 0 and 1, where 0 means no degradation effect and 1 indicates the amplicon is completely degraded. The closer *D* *P*_{
i
j
} is to 0, the less the region of ith amplicon is affected by amplification in jth sample, and vice versa.

### Model Adjustment for RNA amplification

According to the two-cycle linear amplification protocol, for each RNA transcript, it might have been shortened three times from the beginning to the end of two-cycle amplification (See Figure 2, details see manual in [7]), which are: (i) RNA degradation during the IVT of first cycle amplification; (ii) Introducing of random primer in the second cycle; and (iii) RNA degradation during the IVT of second cycle amplification.

To better understand the model, we introduce the notation. To simplify calculation, we first assume that RNA is degrading at an even speed, regardless the sequence effect, e.g. GC content. Let 0 and *L* (base pair, bp) be the positions of 3’ end and 5’ end of a transcript before amplification, where *L* is the length of the transcript. Let *A*_{
i
} and *B*_{
i
} be the position of the new 3’ and 5’ end after the *ith* shorten (*i* = 1,2,3). It’s obvious that *A*_{0} = 0, *B*_{0} = *L*. (i) After the first cycle amplification, based on the uniform assumption of RNA degradation, we further assume *A*_{1} ~ *U*[*A*_{0},*A*_{0} + *a*] and *B*_{1} ~ *U*[*B*_{0} − *b*,*B*_{0}], where *a*,*b* are the degradation limits of 3’ end and 5’ end separately. (ii) In the second cycle of amplification, 3’ and 5’ ends varies more complicated. As random primer is used in the step of the first strand cDNA synthesis from antisense RNA produced in the first cycle of amplification, which may result in incomplete synthesis. 5’ end is decreased while 3’ end remains the same. Therefore, the position of new 3’ end *A*_{2} is the same as *A*_{1}, while the position of new 5’ end *B*_{2} ~ *U*[*A*_{1},*B*_{1}], as random primer can bind to any position of the cRNA. (iii) As the same as in the first cycle, degradation takes place at both 3’ end and 5’ end in the IVT steps of the second cycle amplification. So it’s similarly concluded that *A*_{3} ~ *U*[*A*_{2},*A*_{2} + *a*] and *B*_{3} ~ *U*[*B*_{2} − *b*,*B*_{2}], where *a*,*b* are the same as in the first cycle.

As most RNA transcripts are more than 200 bp, we utilized integral to substitute sum for calculation. Then we could obtain the joint distribution of (*A*_{
i
},*B*_{
i
}),*i* = 1,2,3.

i) The joint distribution of *A*_{1} and *B*_{1} is:

\begin{array}{ll}{F}_{1}(x,y)& =P({A}_{1}\le x,{B}_{1}\le y)\\ =\frac{x(y+b-L)}{\mathit{\text{ab}}},(x\in [0,a],y\in [L-b,L\left]\right)\end{array}

(3)

ii) The joint distribution of *A*_{2} and *B*_{2} is:

\begin{array}{ll}{F}_{2}(x,y)& =P({A}_{2}\le x,{B}_{2}\le y)\phantom{\rule{2em}{0ex}}\\ =\underset{0}{\overset{a}{\int}}d{x}^{{}^{\prime}}\underset{L-b}{\overset{y}{\int}}P({A}_{2}\le x,{B}_{2}\le y|{A}_{1}\phantom{\rule{2em}{0ex}}\\ ={x}^{{}^{\prime}},{B}_{1}={y}^{{}^{\prime}}\left){p}_{1}\right({x}^{{}^{\prime}},{y}^{{}^{\prime}})d{y}^{{}^{\prime}}\phantom{\rule{2em}{0ex}}\end{array}

(4)

where

{p}_{1}({x}^{{}^{\prime}},{y}^{{}^{\prime}})=\frac{{\partial}^{2}{F}_{1}({x}^{{}^{\prime}},{y}^{{}^{\prime}})}{\partial {x}^{{}^{\prime}}\partial {y}^{{}^{\prime}}},

(5)

\begin{array}{ll}P& ({A}_{2}\le x,{B}_{2}\le y|{A}_{1}={x}^{{}^{\prime}},{B}_{1}={y}^{{}^{\prime}})\\ \phantom{\rule{1.5em}{0ex}}=\left\{\begin{array}{l}\frac{y-x}{{y}^{{}^{\prime}}-x},\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}x={x}^{{}^{\prime}}\\ 0,\phantom{\rule{2em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{for}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}x\ne {x}^{{}^{\prime}}\end{array}\right.\end{array}

iii).The joint distribution of *A*_{3} and *B*_{3} is:

\begin{array}{ll}{F}_{3}(x,y)& =P({A}_{3}\le x,{B}_{3}\le y)\phantom{\rule{2em}{0ex}}\\ =\underset{x}{\overset{x+a}{\int}}d{x}^{{}^{\prime}}\underset{y-b}{\overset{y}{\int}}P({A}_{3}\le x,{B}_{3}\le y|{A}_{2}\phantom{\rule{2em}{0ex}}\\ ={x}^{{}^{\prime}},{B}_{2}={y}^{{}^{\prime}}\left){p}_{2}\right({x}^{{}^{\prime}},{y}^{{}^{\prime}})d{y}^{{}^{\prime}}\phantom{\rule{2em}{0ex}}\end{array}

(6)

where

\begin{array}{l}{p}_{2}({x}^{{}^{\prime}},{y}^{{}^{\prime}})=\frac{{\partial}^{2}{F}_{2}({x}^{{}^{\prime}},{y}^{{}^{\prime}})}{\partial {x}^{{}^{\prime}}\partial {y}^{{}^{\prime}}},\end{array}

\begin{array}{l}P({A}_{3}\le x,{B}_{3}\le y|{A}_{2}={x}^{{}^{\prime}},{B}_{2}={y}^{{}^{\prime}})\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1.5em}{0ex}}=\frac{(x-{x}^{{}^{\prime}})(y+b-{y}^{{}^{\prime}})}{\mathit{\text{ab}}}\phantom{\rule{2em}{0ex}}\end{array}

(7)

Thus *F*_{3}(*x*,*y*) can be rewritten out piecewise (shown in Additional file 1: Formula F1). Then, for each transcript’s products, we could estimate a Retaining Rate function *p*_{
i
}(*z*), *i* = 1,2, which states a probability indicates how probable the the nucleotide on the position of *z* (bp) retains after the incomplete synthesis in the *ith* cycle amplification. Here, *z*, in [0,*L*], represents the distance of a nucleotide on the transcript away from its 3’ end. After the first cycle amplification, the Retaining Rate function *p*_{1}(*z*), can be easily calculated from the distribution of 3’ and 5’ end:

{p}_{1}\left(z\right)=\left\{\begin{array}{ll}\frac{z}{a},& \phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}0\le z\le a\\ 1,& \phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}a<z<L-b\\ \frac{L-z}{b},& \phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}L-b\le z\le L\end{array}\right.

(8)

But for the second cycle amplification, the Retaining Rate function *p*_{2}(*z*) is complicated piecewise, associated with *F*_{3}(*x*,*y*). We calculated it by integrating with R software.

### Adjustment of biased signal

For each transcript, there is a probe set containing proximately 11 probes (99.58% of 57381 probe sets for rice microarray) on microarray to detect its the expression signal. We first define a Retaining Rate for a position on transcript, which is a probability indicates how probable the position of the probe on transcript remains after two-cycle amplification. Before the first step of two-cycle amplification, the Retaining Rate for each probe is 1, when amplification is over, the retaining rate for probe at *z* is *p*_{2}(*z*) (Note that *z* refers to distance of the middle nucleotide of probe from 3’ end of the transcript). For the *kth* poly(A) RNA, the ideal intensity of *ith* probe after *jth* cycle, j = 1,2, is *I* *P* *S*_{
j
k
}; and the observed intensity (PM intensity) for *ith* probe at position *z*_{
i
k
} is *I* *P*_{
i
j
k
}. Then we can get *I* *P*_{
i
j
k
} as a function of *I* *P* *S*_{
j
k
}:

I{P}_{\mathit{\text{ijk}}}=\mathit{\text{IP}}{S}_{\mathit{\text{jk}}}\times {P}_{j}\left({z}_{\mathit{\text{ik}}}\right)+{\epsilon}_{\mathit{\text{ijk}}}

(9)

For determined *j*, *k* and all *i*, {\epsilon}_{\mathit{\text{ijk}}}\sim N(0,{\sigma}_{\mathit{\text{jk}}}^{2}), where {\sigma}_{\mathit{\text{jk}}}^{2} is unknown. Then least square estimation becomes maximum likelihood estimation.

Thus, the maximum likelihood estimation for *IPS*_{
j
k
} is:

\mathit{\text{IP}}{S}_{\mathit{\text{jk}}}=\frac{\sum _{i}(I{P}_{\mathit{\text{ijk}}}\times {p}_{j}({z}_{\mathit{\text{ik}}}\left)\right)}{\sum _{i}{p}_{j}^{2}\left({z}_{\mathit{\text{ik}}}\right)}

(10)

where *i* = 1,2,…*n*_{
j
}, *n*_{
j
} is the number of probes in the *jth* probe set.

Although the above model only uses PM of probe for training the parameters, it can also be used to adjust MM intensity, as Retaining Rate represents the probability of the nucleotide of each position on transcript to be remained after two-cycle amplification. As our model only correct for the bias caused by degradation during amplification at probe level, the corrected intensity didn’t perform the background correction or normalization at probe set level. We think it necessary to perform a further pre-processing to adjust for bias introduced by cross-hybridization or other factors. After adjusting for PM and MM, we could apply many existing popular normalization methods, such as PDNN [15], dchip [16] and RMA [17] to preprocess the modified microarray data before further investigation. In this study, we choose PDNN as it is more powerful by considering the sequence binding information [18, 19].