Datasets used
The dataset comprised expression levels of 491 genes in 33 samples, with emphasis on the 17 samples directly related to grain filling [1]. The complete dataset used is available at http://www.blackwell-science.com/products/journals/suppmat/PBI/PBI006/PBI006sm.htm. Based on their sequence annotation and functional classification [14], the 491 genes were selected because their products are presumably involved in or associated with three major pathways of nutrient partitioning: the synthesis and transport of fatty acids, carbohydrates, and proteins. The 17 grain filling related tissue samples include panicle 1–3 cm, panicle 4–7 cm, panicle 8–14 cm, panicle 15–20 cm, seed 0 day, seed 2 day, seed 4 day, seed 7 day, seed 9 day, seed (soft dough), seed (hard dough), embryo, endosperm, seed coat (milk stage), aleurone, and seed (milk stage). A complete description of the experimental protocols used to generate this dataset can be found in Zhu et al (2003) [1].
Data normalization
In our matrix A' each row corresponded to a different gene and each column corresponded to one of 17 different conditions. The a
ij
cell in A' was the expression level of gene i under condition j. The data in A' was transformed to the n × m matrix A according to the protocol in Zhu et al. (2002) [1]. During this transformation values of a
ij
less than 5 were set equal to 5 and log2-transformed. Next, the expression vectors were median-centered and normalized such that the sum of squares for each expression vector was equal to one. In efforts to validate our results, we also investigated gene expression level in a wider range of samples, including the 17 mentioned above, totaling 33 samples. The normalization applied to this broader set was the same as that described for the set of 17, above. Note that the difference in sample number will affect the median centering and normalization steps, making smaller deviations from the median less obvious.
Data decomposition
The SVD theorem (Press et al., 1992) is stated in eq1 [15]. U (n × q) and V (q × q) contain orthogonal vectors, and W (q × q) is a diagonal matrix of coefficients or singular values,
A = UWVT[1]
denoted w1, w2, ... wq. q is the rank of A, and is generally the smaller of the two dimensions n and m. The decomposition was performed using the commercial software package S-PLUS™ (Insightful Co., Seattle, WA) according to Golub and van Loan (1996) [16]. The rows of VT, or V transposed, are the right singular vectors, v
j
. Each right singular vector, alone or in combination with other vectors, describes a pattern of variation in A that could be indicative of a biological process. The columns of U are the left singular vectors, u
j
. Each coefficient, u
ij
, indicates the relative contribution of pattern v
j
to the expression profile of gene i. The singular value w
i
indicates the relative contribution of pattern v
i
to all gene expression patterns in A. The square of the singular values divided by the sum of singular values squares defines the relative variance for each singular value. This relative variance indicates how much of the variance in A is explained by a particular singular vector. The expression profile of any gene can be written as a linear combination of these singular vectors and the singular values in W.
Pattern recognition
The right singular vectors that match our preconception of a grain filling pattern of expression, for example, low expression during panicle development and increasing expression during grain development, were identified after A was decomposed. For each interesting pattern, v
j
, the genes, g
i
, were sorted by u
ij
and the top 80th percentile were selected. These top scorers were compared to 98 genes previously identified as grain filling-related nutrient partitioning genes by Zhu et al., which they used as a template for selecting other genes and transcription factors involved in grain filling. In Zhu et al., the 98 genes were manually selected by visualization of a hierarchical clustering informed by a SOM grouping of the 491 potential nutrient partitioning genes (Figures 5,6). The quality of the ordering given by u
j
was assessed by plotting the percent of the 98 found having a percentile greater than p for all p less than 1. Similarly, the percent of those genes selected that are in the set of 98, for all p, is plotted.
We observed the entropy (E) of various distributions during our study, and the generalized formula we used is shown in Equations 2 and 3 for a vector F, containing N scalars.
Promoter analysis
After genes were classified, their promoter sequences were identified to check if pattern similarity could be related to conserved cis elements. The statistically significant elements were identified with a PERL script and annotated with the PLACE database [17]. The PERL script identified motifs among promoter sequences for a given set of genes. Those elements that matched to an annotated cis-acting regulatory DNA element from the PLACE database were then presented. We limited our investigation to elements located within 2 KB of the transcriptional start site and that had an e-value less than 3E-02. At the time of publication, not all probe sets could be associated with high quality assembled upstream sequences.