Systematic interpretation of microarray data using experiment annotations

Background Up to now, microarray data are mostly assessed in context with only one or few parameters characterizing the experimental conditions under study. More explicit experiment annotations, however, are highly useful for interpreting microarray data, when available in a statistically accessible format. Results We provide means to preprocess these additional data, and to extract relevant traits corresponding to the transcription patterns under study. We found correspondence analysis particularly well-suited for mapping such extracted traits. It visualizes associations both among and between the traits, the hereby annotated experiments, and the genes, revealing how they are all interrelated. Here, we apply our methods to the systematic interpretation of radioactive (single channel) and two-channel data, stemming from model organisms such as yeast and drosophila up to complex human cancer samples. Inclusion of technical parameters allows for identification of artifacts and flaws in experimental design. Conclusion Biological and clinical traits can act as landmarks in transcription space, systematically mapping the variance of large datasets from the predominant changes down toward intricate details.

Pitfalls in experimental design (fly data). A high-ranking technical parameter should be regarded a warning sign more than a death sentence to further anaysis of the datset, though. In the next example, the variance captured by array series may comprise a considerable share of biological variance due to a flaw in the experimental design. The data summerized by Table 2 draw a picture of the entire drosophila life cycle. They are described in detail by ref. [9].
Out of 120 experiment annotations, only 7 take different values throughout the dataset. These data stem from the more common glass-microarrays involving fluorescent labelling. Here, we would expect much less variance corresponding to different array batches because ratios reflecting the competition of two differently labelled transcripts should be less dependent on the absolute amount of binding sites, i.e. on the amount of DNA spotted. In contrast, the array batch shows up on rank 4. But performing these hybridizations, array batches have not been evenly distributed over the development of the fly. In contrast, array series 2 was only used for embryonic stages, for example. Biological variance acts as a confounder and might well make up a considerable 1 amount of the variance captured by array series.
A second flaw in the experimental design becomes visible in the rank of the annotation array individual. This variable offers the opportunity to track fabrication faults such as pins sticking in an upper position in the pin tool not touching the surface for a limited number of consecutively produced chips. The numbering of individual arrays follows the chronological order of their fabrication. But since they were largely used in the same sequence, it also reflects the chronological order of the hybridization experiments. Thus consecutive blocks of array individuals may as well correspond to experiment clusters representing particular experimental conditions rather than to a fabrication fault. Arrays 3 to 6 correspond to very similar pupal stages two and three, just to name one example (data not shown).
Thus, biological variance acts as a confounder for technical annotations, indicating flaws in experimental design rather than an artifact. Although it cannot be shown in this data example, we generally assume that the array batches have little influence on two-channel data. The remaining annotations have been investigated by CA. They draw a sound and revealing picture of the drosophila embryonic development (data not shown). In order to eliminate the confounding influence of the hybridization order, the arrays should be randomly permuted before use.

Overview (cancer data)
The pancreas cancer data comprise 87 hybridizations of 20 pancreas carcinoma and 10 normal tissue samples stemming from 30 patients. The chip comprises 3559 features (in double-spotting) with the main focus on cancer-relevant genes [10]. The annotation explaining most variance of the transcription data is the tumor type (Table 3). As for the previous data set, these data show a non-random distribution of array production batches among the experimental conditions, albeit to a lesser extent. Ranking the annotations by contributed variance (inertia), array series shows up at position 10. It shows a negative SV, though. The variance is due to an unbalanced distribution of the array batches throughout the tumor types. Ductal adenocarcinoma have been exclusively hybridized on array batch 8 just to give one example. Excluding such unique tumor types as well as unbalanced array batches shifts array series to a rank at the lower part of the table (Table 6), corroborating our assumption that direct comparisons across array batches are possible with two-channel data.
Thus, we further investigated the pancreas samples. They have been summarized by reducing to only four trait-clusters (Fig. 2). The clusters are numbered by increasing malignancy. Figure 3 arranges the cluster-centroids from right to left, showing associated genes.
Benign. Many genes associated to cluster 1 (red) and 2 (blue) are indicative of normal, differentiated and functional pancreatic tissue: Pancreatic lipase (PN-LIP) is a typical enzyme secreted by normal pancreatic exocrine cells in order to digest nutritional triglycerides. Carboxypeptidase A1 (CPA1) is a pancreatic exopeptidase. The sequence of PRSS2 is similar to trypsinogen IVa precursor mRNA. Trypsinogen is a typical pro-enzyme secreted by normal pancreatic exocrine cells in order to digest nutritional proteins. Pancreatic polypeptide (PPY) is a pancreas specific hormone which is involved in the regulation of exocrine pancreas secretion and biliary tract mobility [11]. The Myeloid cell leukemia sequence 1 (MCL1) was found to be expressed in normal fetal and embryonic pancreatic tissues and has been suggested to be involved in the control of proliferation and differentiation of normal pancreatic cells [12] and appears to play a role in control of pancreatic islet cell growth [13]. Taken together, these genes are typically expressed by normal pancreatic cells and encode for proteins required for food digestion.
Mucinous. Exclusively associated to cluster three (magenta, mucinous tumors) are the connective tissue growth factor (CTGF) which will be discussed in context of present alcohol consumption as well as the glutathione peroxidase 3 (GPX3) which was reported to be overexpressed in ovarian cancer [14]. CTGF is expressed as a 2.4-kb mRNA in a broad spectrum of human tissues. Sequence comparison revealed that CTGF belongs to a group known as the immediateearly genes, which are expressed after induction by growth factors or certain oncogenes [15]. Connective tissue growth factor is involved in pancreatic repair and tissue remodeling in acute necrotizing pancreatitis [16].
Mucinous and ductal. On the left side of the plot we find genes associated both to cluster 3 and 4: Pancreatic cancer shows a strong desmoplastic reaction characterized by a remarkable proliferation of interstitial connective tissue (collagen, fibronectin (FN1)) [17,18,19]. The tyrosine kinase receptor (Tie-1) was shown to be upregulated in (and can serve as a prognostic marker for) various metastatic malignancies incl. leucemia, breast and gastric cancer [20,21,22,23,24,25,26,27]. However, to our knowledge, we are the first to report it in the context of pancreas carcinoma. The interferon-alpha induced 11.5 kDa protein (IFI27) was suggested to be a novel marker of epithelial proliferation and cancer [28] and will be discussed later on in the context of past alcohol consumption. The nonspecific cross-reacting antigen (NCA) is already associated more to cluster four (adenocarcinoma), confirming results reported earlier [29].
Ductal. Exclusively associated to the fourth cluster (green, highest malignancy) are the fibroblast growth factor 2 (FGF2, consistent with [30]), the Clostridium perfringens enterotoxin receptor (CPE-R) which was discussed in context of prostate cancer [31], and Glutathione [32]. Also applicable to discriminating the highly agressive tumors from the mucinous are mucin 1 (MUC1), which is in agreement with [33], as well as two more genes described as follows. Increasing evidence has accumulated in support of the hypothesis that growth hormone (GH) and insulin-like growth factors (IGFs) play a role in carcinogenesis. Insulin like growth factor binding protein 3 (IGFBP3) is upregulated in pancreatic endocrine tumors and its overexpression is significantly more com-mon in metastatic disease [34]. High expression of IGFBP3 has been associated with invasiveness and poor prognosis in other cancer types [35]. GH receptor antagonist treatment decreased colon carcinoma growth in nude mice, associated with reduction in circulating IGFBP3 levels [36]. Elongation factor 1γ (EF1γ) is overexpressed in esophageal cancer with severe lymph node metastasis and far advanced stages of the disease compared with non-overexpressing cases [37]. In summary, genes affiliated to cluster four are known to be associated with metastasis, advanced stage disease and poor prognosis of pancreatic and other cancers.

Cluster four (cancer data)
To demonstrate how the analysis prodeeds towards more detail, we now analyze cluster four alone. It comprises six annotation values that are combined to four (Fig. 6) gaining perfect projection within two dimensions (Fig. 7, upper right). Figure 7 shows the variance within cluster four. All 22 measurements annotated by traits of cluster four (black boxes) are related to tumors classified as either ductal adenocarcinoma or other (i.e. none of the more benign tumor types). For 16 measurements, patients died within one year after surgery.
Pink cluster.The upper left corner holds six measurements annotated with no other than these traits along with genes associated with normal pancreatic function (pancreatic polypeptide PPY and Glycogen synthase) or adhaesion (laminin alpha 5, LAMA5) along with all other measurements not annotated with any trait of cluster four (grey boxes).
Blue cluster. All other 16 measurements annotated by cluster four are metastatic (tumor site=kidney). 12 of them stem from patients who received post operational chemotherapy. They split up into 8 measurements annotated with both past alcohol consumption and chemotherapy preceeding surgery (red) and 8 others classified pN stage 1 as well as WHO stage III (green). Common to both groups is an overexpression of mucin 1 (MUC1, [33]), GW112 (reported in context of gastric cancer metastasis [38]), and the S100 calcium binding protein (S100Ca+) which was shown to promote invasiveness of pancreatic cancer [39].
Red cluster. Chemokine (CXC motif) receptor 4 (CXCR4) has been linked to metastazation, male gender and older ages in colorectal cancer [40] and has been reported in context of invasiveness of pancreas cancer before. Our data show it associated to past alcohol consumption as will be detailed later under Alcohol Consumption.
Green cluster. Another kind of ductal adenocarcinoma involving lymphnode invasion (pN stage=1) is associated to upregulation of Paxillin (PXN) and Fas-activated serine/threonine kinase (FASTK), instead. While PXN has been linked to cancer cell migration [41], FASTK might correspond to a gain of chromosome 7 as reported in context of radiation resistance in glioblastoma multiforme [42].
In summary, trait cluster four annotates six non-metastatic and sixteen metastatic measurements. The latter fall into two transcriptional types, presumably due to cytogenetic rearrangements. From this step (Figs. 6 and 7), the analysis would continue by assessing the difference between the two annotation values shown in pink in Fig. 6 in another step. A last step would be to account for the difference between the two blue traits, completing the analysis of the most malign quarter of the pancreas data's variance.

Alcohol consumption (cancer data)
Knowing from Table 3 which parameters correlate to transcription, one can select one or several of special interest. Figure 4 projects the values taken by experiment annotation 'alcohol consumption' (Table 4). It explains 92.8% of their variance. The abscissa (first principal axis) largely projects the difference between the pooled normal tissue samples as a reference and alcohol intake of cancer patients. Because the reference samples have all been annotated as pooled instead of discriminating between present, past and no alcohol consumption also for healthy individuals, above difference is trivial, including also the difference between normal and cancer tissue under a different name.
Thus, on the right side of Fig. 4 there are tagged genes already discussed for the overview such as PRSS2, PNLIP, and PPY, which are typically expressed by normal pancreatic cells for food digestion and which are downregulated upon alcohol consumption (both past and present).
Past and present alcohol consumption. The following genes are upregulated with both past and present alcohol consumption: Fibronectin (FN1), and collagens Type I (COL1A2) and III (COL3A1) have already been discussed above in context of the strong desmoplastic reaction of pancreatic cancer. Matrix Metalloproteinase 2 (MMP2) has been found to be expressed in pancreatic cancers and has been positively correlated with metastasis [43,44]. Furthermore, MMP2 has been found to be a diagnositc marker for pancreatic carcinoma in pancreatic juice [45].
In summary, the geneset negatively or positively associated to alcohol consumption in general characterizes healthy pancreatic tissue on one hand and the dense connective tissue reaction of pancreatic cancer involving fibronectin and collagens type I and III on the other.
Significant difference. In the following, we attempt to discriminate between present and past alcohol consumption among the cancer patients. The ordinate (second principal axis) explains 19.8% of the total variance, almost exclusively corresponding to the variance between past and present alcohol consumption. But is it also significant? We performed a significance analysis of microarrays (SAM, [46]). A dataset including all the genes (3559) and one gene-wise median per patient yielded (∆=0.72) 1082 significant genes with an estimated false discovery rate (FDR) of 4.5%. In rare cases, the technical variance of the measurement process may lower the number of significant genes. Here, however, including all hybridizations (double-spots averaged) instead of one median per patient yields more significant genes (1495 with 3.8% FDR at ∆=0.8), as may be expected (for reasonably reproducible measurements).
Past alcohol consumption. The two categories have been further characterized in terms of prior knowledge about single differential genes. Following genes are exclusively associated to past alcohol consumption, all previously reported in context of carcinogenesis or chronic pancreas damage: IFI27, also associated to cluster 3 and 4 in Fig. 3 is an interferon alphainducible protein. It has very recently been found to be upregulated in epidermal skin lesions, during wound reapair in proliferating cells as well as in cutaneous squamous cell cancers and thus, it was suggested that IFI27 is a novel marker of epithelial proliferation and cancer [28]. The gene was found to be overexpressed in colon samples from patients with inflammatory bowel disease compared with normal colon samples [47] and in hepatocellular carcinoma samples [48]. Thus, increased expression of IFI27 may indicate chronic inflammation, regeneration and tissue repair of the pancreas in individuals with past alcohol consumption. All these physiological processes predispose to cancer development. S100P is the calcium binding protein of protein family 100. S100 proteins are localized in the cytoplasm and/or nucleus of a wide range of cells, and involved in the regulation of a number of cellular processes such as cell cycle progression and differentiation. S100P has been found to stimulate cell proliferation and survival in an autocrine manner in cells [49]. Furthermore, it is implicated in pancreatic tumorigenesis and metastasis [50]. Members of the family of S100 proteins have been found to be expressed in various metastatic adenocarcinomas [51] and poorly differentiated carcinomas [52]. Our data show that alcohol consumption over a longer period is associated with expression of S100P in pancreatic cells, which may promote malignant transformation by chronic stimulation of cell proliferation and survival, making cells more prone to additional mutagenic events.
Keratins represent a family of more than 20 different polypeptides which are important markers of epithelial cell differentiation. Precursor cells in different tissues display high Keratin 19 (K19) levels, and upon differentiation, K19 expression becomes epithelial cell-specific. Many premalignant and malignant tissues display K19 expression, such as dysplasia and carcinoma of squamous epithelia and adenocarcinoma of the lung, breast, pancreas, stomach and colon [53]. Exocrine acinar cells and endocrine islet cells are well-differentiated cells which express the keratin combination 8 and 18, whereas the less-differentiated cells of the ductal tree are characterized by the additional expression of keratin 7 and keratin 19 [54]. In the developing pancreas, duct-like precursor cells harbor high K19 expression [53]. Cytokeratin 19 expression has been described as expression marker in pancreatic carcinomas [55]. Higher expression of cytokeratin 19 was observed in human pancreatic epithelia in early stages of development (14 weeks of gestation) compared with adult tissues [56]. Thus, past alcohol consumption is associated with expression of a marker gene for undifferentiated, immature stem-cell like pancreatic cells. This may reflect chronic damage of panreatic cells with subsequent regenerative activity of the organ which in turn can render pancreatic cells more prone to additional genetic mutational changes.
Also associated to past alcohol consumption is the chemokine receptor 4 (CXCR4). Tumour cell migration and metastasis share many similarities with leukocyte trafficking, which is critically regulated by chemokines and their receptors. The chemokine receptor 4 is highly expressed in primary and metastatic human breast cancer cells but is undetectable in normal mammary tissue and has been implicated in breast cancer metastasis [57]. Similarly, CXCR4 has been implicated in thyroid carcimoma invasion and tumor cell migration [58]. In pancreatic carcinoma, the CXCR4 receptor ligand system may play a role in the pancreatic cancer progression through tumor cell migration and invasion [59,60]. Thus, past alcohol consumption is associated with expression of a receptor system that has been implicated in pancreatic cancer cell migration, invasion and metastasis.
In summary, genes upregulated with past alcohol consumption have been linked to pysiological processes associated with increased risk for malignant transformation and genes involved in pancreatic cancer cell proliferation, survival, invasion, metastasis, and impaired cell differentiation.
Present alcohol consumption. Following five genes associated to present alcohol consumption are tagged in Figure 4: The connective tissue growth factor (CTGF) is expressed as a 2.4-kb mRNA in a broad spectrum of human tissues. Sequence comparison revealed that CTGF belongs to a group known as the immediate-early genes, which are expressed after induction by growth factors or certain oncogenes [15]. Connective tissue growth factor is involved in pancreatic repair and tissue remodeling in acute necrotizing pancreatitis [16].
The vascular endothelial growth factor (VEGF) is a mitogen primarily for vascular endothelial cells. It is upregulated in many cancer tissues and plays a central role in tumor angiogenesis [61]. VEGF-expression is induced in endothelial cells after ethanol exposure in vitro [62] and in gastric mucosa in vivo [63]. Increased VEGF-expression was found upon moderate ethanol consumption in rat skeletal muscle [64]. Thus, these reports support our finding that present alcohol consumption induces VEGF-expression in pancreatic tissue and may promote tumor-angiogenesis.
The tissue inhibitor of metalloproteinase 3 (TIMP3) is an inhibitor of matrixmetalloproteinases and an important regulator of inflammatory responses [65]. Increased metalloproteinase-activities are found in experimental and acute pancreatitis [66,67]. Thus, increased expression of TIMP3 in individuals with present alcohol consumption may indicate the acute response of the pancreas to cope with ethanol induced tissue damage of the organ.
In rats, ethanol intake leads to increased secretion of tissue inhibitor of metalloproteinase 2 (TIMP2) by pancreatic stellate cells (PSCs) which are thought to play a central role in pancreatic extracellular matrix formation and fibrogenesis [68]. Imbalance of expression of matrix metalloproteinases (MMPs) and their inhibitors including TIMP2 was described in human pancreatic carcinoma [69]. In our findings, the gene is less specific for present alcohol consumption. We mention it because of the direct experimental evidence for being linked to ethanol intake.
Also as overexpressed specifically with present alcohol consumption we observe the interferon induced transmembrane protein 1 (IFITM1). Interferon expression is increased in patients with acute pancreatitis compared with chronic pancreatitis [70], corroborating our assumption that we can discriminate between acute and past alcohol consumption in the context of pancreas cancer development.
Nine more genes of similar specificity, i.e. showing at least higher abundance with present than with past alcohol consumption, are encircled in grey in Figure  4: Expression of ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2 (avian)), a protooncogene and transcription factor, occurs in a variety of cell types. Translocation in acute myeloid leukemia M2 involves the ETS2 gene [71].
Thrombospondin I (THBS1) is a multimodular secreted protein that associates with the extracellular matrix and possesses a variety of biological functions, including a potent angiogenic activity. Thrombospondin 1 has been implicated in tumor invasion, angiogenesis and metastasis of pancreatic carcinoma [72] and is a predictor of ductal pancreatic carcinoma [73].
Thymidylate synthetase (TYMS) is involved in maintaining the dTMP (thymidine-5-prime monophosphate) pool critical for DNA replication and repair. The enzyme has been of interest as a target for cancer chemotherapeutic agents. It is considered to be the primary site of action for 5-fluorouracil, 5fluoro-2-prime-deoxyuridine, and some folate analogs. TYMS expression is a marker of poor prognosis in resected pancreatic cancer. Patients with high intratumoral TYMS expression benefit from adjuvant therapy with 5-fluorouracil [74].
For the protein NM23B expressed in non-metastatic cells 2, abundance in human pancreatic cancer is positively associated with lymph node metastasis, perineural invasion and poor prognosis [75].
The corticotropin-releasing hormone system regulates the mammalian stress response by coordinating the activity of the hypothalamic-pituitary-adrenal axis. The corticotropin releasing hormone binding protein (CRHBP) is an important element in the CRH system.
The early growth response 2 (EGR2) protein induces apoptosis in various cancer cell lines [76] and may be involved in the tumor growth suppressing effect of the PTEN pathway [77]. To our knowledge we are the first to report it as overexpressed with alcohol consumption or in pancreas carcinoma.
The polo-like kinase 1 (PLK1) encodes a protein serine/threonine kinase which plays a role in cell cycle regulation. Expression of PLK1 promotes mitosis (cell proliferation) and cells transformed with PLK1 grow in soft agar and produce tumors in nude mice. Therefore PLK may be involved in the promo-tion or progression of cancers [78]. Elevated expression of PLK1 occurs in many different types of cancer, and PLK1 has been proposed as a diagnostic marker for several tumors. siRNA-mediated knock-down of PLK1 dramatically inhibited cell proliferation, decreased viability, and resulted in cell-cycle arrest and apoptosis [79]. Very recently, Plk1 mRNA was found to be overexpressed in pancreatic cancer cell lines and in human tumors. Depletion of Plk1 in pancreatic cancer cells by the use of antisense oligonucleotides induced cell cycle arrest in G2-M as well as a drastic reduction in proliferation rates. It was suggested that Plk1 is a potential therapeutic target for the treatment of pancreatic cancer [80].
The oncogene RHOA (ras homolog gene family, member A) was found frequently overexpressed in gastric cancer tissues and cells compared with normal tissues or gastric epithelial cells. Both RhoA-specific siRNA and dominantnegative RhoA expressions could significantly inhibit the proliferation and tumorigenicity of AGS cells [81]. RhoA overexpression has been linked to cancer cell detachment and metastasis [82] and to progression of testicular cancer [83].
The dual specificity phosphatase 1 (DUSP1) is higly induced by enviromental stress [84]. It has dual specificity for tyrosine and threonine and specifically inactivates mitogen-activated protein kinase in vitro [85].
In summary, present alcohol consumption is associated with expression of genes that control cell proliferation and transformation of pancreatic cells, DNAsynthesis, encode for oncogenes, genes linked to acute inflammatory and stress responses, tumor angiogenesis, pancreas carcinoma metastasis, pancreatic repair and tissue remodeling, and pancreatic extracellular matrix composition.
In contrast to past alcohol consumption, present alcohol intake is associated with expression of immediate response genes to tissue damage, repair and remodeling, inflammatory and stress response (IFITM1, CRHBP, TIMP 2 and 3, DUSP1, CTGF). Additionally, correspondence analysis also identified genes that are involved in control of cell proliferation, oncogenesis and tumor angiogenesis to be upregulated in present alcohol intake.
The raw intensity values have been normalized based on a robust affine-linear regression of one measurement versus a control measurement. The method is described in refs. [86] and [87] and performs better than or equally to lowess normalization [88]. For the monochannel (yeast) data, each measurement was normalized versus the genewise median of the 75 hybridizations of the control condition, resulting in absolute intensities. For the two-channel data (fly, human cancer), the channel belonging to the control condition served to normalize the other channel of the same hybridization, resulting in intensity ratios [8].
Subsequently, genes have been filtered for considerable absolute intensity level in at least one of the conditions (i); reproducibility in the separation from the control condition in at least one of the other conditions (ii) [86,87,8]; and lack of saturation (iii). To compute intensity levels (i) from multichannel ratios, these ratios have had been multiplied with the genewise median of the absolute values of the control channels [8]. Following thresholds have been applied: For the yeast data, the condition-median had to be at least 1.5 × 10 4 in at least one condition (i). Furthermore, we asked for complete separation (minmax separation, [86]) of control and non-control measurements in at least one condition (ii). 924 out of 6103 genes satisfied both criteria.
For the fly data, we filtered genes showing condition-median intensity equal or greater 10 5 in at least one condition (i), as well as an average minmax separation greater or equal to zero, yielding 6938 out of 22429 genes (ii).
For the human pancreatic cancer data, condition-median intensity had to be at least 2 × 10 5 in at least one condition (i) accompanied by positive minmax separation in at least one condition (ii). Furthermore, because the data were prone to saturated spots, we excluded all genes with raw intensities exceeding 3 × 10 6 in any measurement (iii), leaving 442 out of 3559 genes.
Further analysis was based on intensities or log ratios of the surviving genes for mono-(yeast) or multi-channel (fly and cancer) data, respectively. Experiment annotation values were represented by adding the condition medians of the annotated conditions.

Preprocessing of experiment annotations
Much like the transcription data themselves, their annotations need to be preprocessed. Not all annotated traits correlate with transcription. Moreover, even before being subject to judgement in this respect, some traits need to be at all defined. The values taken by a continous annotation will most probably differ for all measurements (if only in the third position after the decimal point). In order to characterize groups of measurements, the value range has to be discretized into few values (intervals) that can be discriminated on the basis of the collected data.

Discretization.
All annotations taking more than 4 values were subjected to discretization. The individual decisions for the presented data are detailed below.
For the yeast data, annotations 'array individual', 'total activity', 'date of entry month', 'experimentator hybridization', and 'temperature' were discretized as shown in Fig. 8. All values were kept without grouping for 'array series', 'strains', 'transgenes', 'base media', 'temporary additive', 'temporary additive conc.', and 'glucose'. Annotations 'label incorporation rate', 'exposure time', and 'date of entry day' where inactivated for obviously not showing meaningful value groups. 'incubation period' may better be investigated additive by additive and was therefore inactivated in this context. Likewise, 'temperature shift incubation period' may better be investigated temperature by temperature and was therefore inactivated. 'array hybridization' exhibited clusters, but we suspect that some experimenters did not annotate it according to its correct meaning (otherwise, it would mean that one chip was reused up to 14 times). As a precaution, it was inactivated, as well.
For the fly data, annotation 'array individual' was discretized as shown in Fig. 9, 'embryo' was kept unchanged, and 'label incorporation rate' as well as 'amount of cDNA' were inactivated for showing no obvious clustering of consecutive values.
For the cancer data, annotations 'live status', 'tumor type', 'pT stage', 'tumor subregion', 'smoking', 'alcohol consumption', 'weight loss in last 4 weeks', and 'OP procedure' were discretized as shown in Fig. 10. Initially, we kept all values of 'tumor type' for appearing nicely correlated to transcription. However, IPMT samples showed negative silhouette values because they do not separate from the healthy tissues (data not shown). Annotations 'array series', 'tumor size', and 'WHO stage' were taken on unchanged. Annotations 'array individual, 'birth date day', 'birth date month', 'birth date year', and 'CA 19-9' were inactivated for showing no obvious clustering of consecutive values.
Filtering. After discretization, annotation values are projected as centroids of the according experimental conditions. This type of investigation works the better, the more the projected traits vary in their transcription profiles. Also, each annotation value should correspond to a homogenous cluster of conditions well-separated from conditions annotated with other values of the same annotation. In this case, it appears perfectly justified to briefly characterize it by a single data point, which can be regarded a prototype or class-representative for the particular annotation value.
But not all traits should be taken at face value. Some do not carry considerable amounts of information in terms of transcription behaviour. We assess this by computing their inertia contributions. The inertia, computed as the χ 2 statistic devided by the grand total of the data table, is a means of assessing the variance or information content of a data table. Here, each table column contains the (prototype) transcription profile of a particular experiment annotation value, contributing a certain share to the total inertia of the table. The discretized annotation values are ranked and/or filtered according to the variance they contribute either in the context of the values of only one (Tab. 4) or all annotations (Tab. 5). For the latter, the variance contributed by all values of a particular annotation can be added in order to rank the annotations by their "relevance" (Tabs. 1, 2, 3, and 6).
To assess if a trait annotates a distinct cluster of experiments or not, we compute the Silhouette value (SV, [89]). Let there be an experimental annotation A taking values i A. One SV per annotation value i and measurement j is computed as s ij = (b ij − a ij )/max(a ij , b ij ), where a ij is the average distance of annotated measurement j to all other measurements annotated with i and b ij is the minimum of average distances of measurement j to all measurements not annotated with i. Here, the Silhouette scores were computed on the basis of the χ 2 distances.
A SV close to one will result for measurements well-separated from the measurements of neighboring clusters (composed of measurements annotated with annotation values other than i). A score around zero means that the measurement could be assigned to another annotation value, as well. A score close to minus one denotes that the object is most likely misclassified, i.e. transcriptionally affiliated to another but the annotated annotation value.
The average SV of all measurements annotated with a particular annotation value i is used to rank and/or filter the annotation values (Tabs. 4 and 5). We further calculate the mean of the SV of all i A. The average SV for an annotation A (Tabs. 3 and 6) indicates if this parameter correlates with transcriptional changes in a reproducible manner. Figure 4 exemplifies the typical behaviour of an uninformative annotation value. The value 'never' shows a negative SV because it is dispersed over the area spanned by the remaining three values of 'alcohol consumption'. It also shows a low variance contribution, mainly because it is quite evenly spread out.
In order not to let the principal axes be attracted either by such ubiquitously dispersed features or those showing an "average" transcription profile, these should be thoroughly filtered. Figure 4 shows informative and uninformative values combined in a single annotation, showing that it is advisable to filter out single annotation values rather than whole annotations. When visualizing more than one parameter (Figs. 2 and 3), we disregarded all annotation values showing negative SV or inertia contributions below one percent.
Whereas the former warrants tight clustering of measurements annotated by a particular annotation value, the latter, in addition to picking marked transcription profiles, also selects for a substantial amount of observations per annotation value. Out of the annotation values listed in Tab. 5 that show positive SV, all consisting of less than four measurements have been excluded from visualization by their low inertia contributions. This is because adding fewer measurements causes a lower column weight and will therefore result in a smaller inertia-contribution. Whereas the inertia criterion favours the annotations having many values, the SV does the opposite. Both criteria supplement each other in order to identify traits potent to characterize the transcription data under study.

Significance analysis of differences between two traits
Significance analysis of microarrays (SAM) has been performed to assess the difference between past and present alcohol consumption. It has been based on log 2 transformed ratios, the data table containing all the genes on the array. Initially, the table contained all measurements of all patients affiliated to present and past alcohol consumption, as well. In order to exclude the technical variance, the gene-wise median of all measurements for one patient has been computed, subjecting one column per patient to SAM as two-class (past and present), unpaired data. A second analysis was performed including the technical variance. Here, we only averaged over the two measurements stemming from duplicate spots on the same array beforehand. SAM version 1.21 (Nov 2002, Excel-plugin obtained from [90] [39] Arumugam T, Simeone D, Golen KV, Logsdon C: S100P promotes pancreatic cancer growth, survival, and invasion. Clin Cancer Res 2005, 11:5356-5364. [46] Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9):5116-21, http://eutils.ncbi.nlm.nih.gov/entrez/ eutils/elink.fcgi?cmd=prlinks\&db\%from=pubmed\&retmode=ref\ &id=11309499.