Applying forces to rat ears and obtaining the ear RNA
Male wistar rats weighing approximately 180 g were chosen for this study. All animal studies were conducted according to institutional guidelines. Strips of latex were fastened to the ears and, using appropriately calibrated rubber bands, forces measuring 50 g were applied to the ears at three points (Figure 2). Eight time points were sampled at 1 h, 4 h, 9 h, 13 h, 19 h, 48 h, 72 h, 96 h. Three rats were used per time point. One ear of each rat (chosen randomly) was stretched while the other served as control. At the time points listed above, the ears were excised and immediately frozen in liquid nitrogen and then stored at -80°C. The three stretched ears at each time point and the three unstretched ears at the same point were separately pooled (70 to 80% of the whole ear was used) and the RNA extractions carried out. The RNA was then hybridized to RAE 230 2.0 Affymetrix rat gene chips. MAS5 was used to obtain the expression values from the images. Expression datasets will be deposited at geodatasets.
Analysis scheme
To gain insights into the molecular mechanisms and pathways underlying the cellular response to forces and tissue stretch, a gene expression analysis was conducted by stretching rat ears (see Figure 2) over a period of five days and then obtaining their expression profiles using Affymetrix arrays. Traditional techniques that analyze gene expression datasets rely on clustering techniques that do not take advantage of pre-existing information about pathways thus not taking account of biological context. This is also true of other analyses that rank genes by fold change or other measures of significance. Enrichment techniques which include The Absolute Enrichment (AE), The Gene Set Enrichment Analysis (GSEA), and what we call The Down Regulated Analysis (DRE) are powerful analysis techniques that not only take advantage of pre-existing pathway knowledge but also in their implementation increase the signal to noise ratio thereby giving us a higher chance of capturing subtle signals in the expression set. Enrichment techniques start by using a ranking metric to reorder the expression dataset.
Next, a priori created groups of genes are scored on this reordered dataset and ranked from most highly scored at the top of the reordered dataset to the least highly scored or 'enriched' groups or gene sets. Thus, rather than looking at significance of each gene within the expression dataset, we look at significance of groups of genes. Since the gene set enrichment analysis is not limited by the type of statistic used, we should use a statistic that will be best at elucidating the changes between the groups that we are studying. We thus conducted our analysis by reordering our dataset using the paired t-statistic as our ranking metric (as our dataset consists of paired data points and thus this statistic will be best for capturing differences). Enrichment techniques can be powerful tools in analyzing data. However, within the enrichment analysis framework, people have focused solely on upregulation. They have perhaps neglected to look at down regulation. Downregulation itself can be just as important as upregulation [5]. However, both downregulated and upregulated (enrichment) analyses may miss differential effects that show some balance between up and downregulation (what essentially we have termed "homeostatic" systems) [5].
Let's look at a (metabolic) pathway (with feedback) to make this more clear. Let's say a pathway has 3 members A through C, and let's say A leads to upregulation of B which leads to upregulation of C (Figure 7). Further, if B can be upregulated independently of A, then B will downregulate A and if C can be upregulated independently of B, then its upregulation will downregulate both A and B (Figure 8). And further, if C itself feeds into another pathway which itself (the pathway) may be upregulated independently of C, then that pathway when independently upregulated will downregulate A, B, and C (Figure 9). We can essentially perturb a pathway at any point from start to finish (and not just at the start or at the end). If we perturb it in between such that there is a good mixture of both upregulation (coming from the downstream parts of the pathway) and downregulation (coming from the upstream parts of the pathway through feedback mechanisms), then this pathway may be captured by the absolute enrichment analysis (see Figure 10). For more details on absolute enrichment see below as well as ref [5]. (It must be noted that pathways don't have to have this serial relationship as in our conceptualized pathway. There can be bifurcations and so on.)
In essence, we are really looking for a change between an "affected" condition and a "nonaffected" or control condition. A change can be either an upward change (upregulation) or a downward change (downregulation). Thus, we're really looking for the highest differential regulation [5]. Sometimes, the absolute enrichment technique captures gene sets that are seen (at the very top) in either the up or down regulated analyses. Many times, however, it captures other gene sets. Because we run an exhaustive search through enrichment techniques that encompasses both up (or down) regulated and absolute enrichment techniques, we test whether the absolute enrichment technique captures anything extra at the top (of the list of ranked gene sets) compared with either of the up or downregulated techniques.
When what the absolute enrichment captures is found at the top of the up or downregulated analyses, then this analysis doesn't add new information (although a few of the rest of the top ranked gene sets in all the enrichment analyses can still be useful to study). However, when it does capture something new, we focus on it (or at least look at its result) because what it has captured is the most differentially expressed [5]. In our analysis, we captured a new gene set that was higher ranked than even the highest ranked upregulated and downregulated gene sets. Thus, we studied this gene set in more detail. This included running a co-expression network analysis (the fact that we could even run this analysis was driven by the small size of the top ranked hypoxia gene set).
Thus, to recap, our analysis scheme consisted of choosing the appropriate ranking metric (or statistic), running the three enrichment techniques, and then focusing on the gene set obtained from the absolute enrichment analysis (since it captured the most differentially regulated gene set). Our results show that the response to hypoxia pathway was the most differentially regulated pathway and that this pathway showed a homeostatic (comprising components of upregulation and downregulation) response. We believe that this pathway is being most differentially expressed through a homeostatic perturbation.
The rationale for the AE is that in a time series analysis it may be important to see how the system responds to the (mechanical stretch) perturbations that we have imposed on the system by regulating itself. Pathways are essentially composed of elements that respond in tandem to perturbations. Often these responses are self controlled through feedback mechanisms and thus homeostatic response analyses can often give us powerful indicators of pathways that are important in gene expression datasets.
Enrichment analyses
Our data analysis uses enrichment techniques to understand important pathways in our system. Enrichment techniques start out by reordering the dataset using an appropriate statistic (please see Figure 11 for an overview of these techniques using absolute enrichment as an example). Consider the dataset to be a rectangular two dimensional array with the different conditions (control or stretch) represented by columns and the different genes represented along the rows. A suitable statistic is chosen that best helps us differentiate between stretch and control for each gene. Mootha's implementation of the gene set enrichment analysis (GSEA) [25] used the signal to noise ratio statistic to differentiate between the affected condition in his dataset and the control condition. Our dataset has paired samples (in the sense that one ear of the same rat was subjected to stretch while the other ear from the same rat at the same time point served as control). Thus, a better statistic to use in our case is the paired t-statistic since it takes better account of pair-wise differences. Enrichment techniques are not limited to the type of statistic used as the ranking metric. Any statistic that best lets us differentiate between the two classes (control versus stretch) can be used [5] (please see the Appendix for a description of the paired t-statistic).
Once the ranking metric is calculated, the dataset is reordered (in our case, we used the paired t statistic). This is where the three enrichment techniques that we used create different orderings. The GSEA gives us an ordering from most upregulated to the least upregulated. The downregulated enrichment or DRE ranks from most downregulated to the least downregulated. The absolute enrichment [5] uses the absolute value of the ranking metric (the paired t-statistic in our analysis) and then reorders the dataset. Whereas both the standard GSEA [25] and the DRE look for unidirectional regulations, the AE looks at a combined regulation and therefore is more effective at picking homeostatic systems. The absolute enrichment (AE) technique [5] looks at homeostatic perturbations rather than sole up or down regulations. Because it takes absolute values of the genes and then enriches gene sets on this absolute value ranked dataset, the AE becomes less sensitive to less perturbed genes. Once the dataset is reordered, we then test various a priori created gene sets to see how important each is vis-à-vis the reordering (construction of gene sets is explained in the next sub section). In contrast to strategies that look purely for most important genes, we are now looking for significant groups of genes or gene sets. When we look at a single gene, we look for how far up on a reordered dataset (reordered say by fold change) the gene finds itself. In an enrichment technique, however, we look for how far up the gene set finds itself in the reordered dataset. The measure of how far 'up' a gene set is placed on the reordered dataset is given by the enrichment score (ES). The enrichment score is calculated in the following way. Calculate the Kolmogorov Smirnov statistic for each probeset ID in our dataset given by,
if the probeset ID is not part of the gene set and by,
if the probeset ID is part of the gene set, where G is the number of genes in the gene set and N is the total number of genes (or rather probeset IDs in the whole dataset – we had 31099 probeset IDs).
We next compute a running sum of this statistic on the reordered dataset. Where the running sum reaches a maximum score is our enrichment score for the gene set in question [25, 5]. (Figure 11 shows the overall steps in running enrichment techniques). The gene sets are then ranked from highest enrichment score to the lowest. To test whether the results of the analysis are significant, we perform permutation testing of the dataset. To obtain the permutes, we swapped random stretch samples with random control samples. One thousand permuted datasets were constructed. All three enrichment analyses were run on each of these newly constructed datasets, giving us one thousand rankings of gene sets based on their enrichment scores (for each enrichment analysis). We then counted the number of times each gene set achieved top rank in the one thousand permutes. This number divided by 1000 gave us the permutation test P value. If this number was lower than 0.05, then the gene set was significant. For more details on the permutation test, please see Mootha et al.[25]. We didn't run technical replicates at each time point because our analysis is looking at an aggregate change between the stretched ear and control ear conditions across all time points. By doing this we look for overall changes between the eight time points versus their corresponding paired controls. We next describe how the various gene sets that were used in the enrichment techniques were constructed.
Construction of gene sets
Many gene sets were compiled by the use of rat orthologs of previously published genesets [25] (according to [26] "it is estimated that 90% of rat genes have orthologs in the mouse and human genomes that have persisted since they shared a common ancestor"). Gene sets are essentially collections of genes. To derive additional gene sets that may show relevance, we searched the literature for different analyses that have been performed on cell excitation such as response to serum. When these analyses listed groups of genes obtained through various clustering or classifying techniques, we created gene sets for the various clusters listed in the papers. Other gene sets were created by looking up all possible relevant keywords at the geneontology.org website (angiogenesis, response to mechanical stimulus and so on). The gene sets obtained through the various mechanisms outlined above when they comprised genes not in Affymetrix Probeset ID format were then converted to collections of Probeset IDs through various platforms (The Affymetrix website, matchminer, Onto tools from the Intelligent systems Bioinformatics Laboratory). Some of the gene sets that were obtained were derived from clusters obtained through classifying gene expression datasets. Sometimes, the clusters were annotated by the publications as being enriched for certain pathways or GO categories. When they were not (as for example the 'c' clusters in Mootha's published gene set list [25]), we used EASE (a tool for obtaining enrichment information on groups of genes) as we explain next. A total of 173 gene sets were constructed.
EASE
For calculation of GO category enrichment, we used EASE. EASE is a program that determines the over-represented GO categories in a group of genes (specified either by probeset ID, accession number or other identifier) by calculating the hypergeometric ratio of the GO annotations of the genes found in an analysis versus the background distribution of each of the GO annotations. It should be noted that EASE is distinct from the various enrichment tests that are used in this paper. Enrichment techniques (such as AE, GSEA up or down regulated enrichment) measure differential expression of a set or group of genes. These groups of genes are pre-defined by us independently of the data set. EASE is independent of the enrichment techniques that we use. We can run EASE on any (possibly random) group of genes. EASE will then find what annotations are over-represented in these groups. Enrichment techniques try to see which pre-defined gene sets are most important in a re-ordered data set (thus the gene sets are previously defined and then compared to the data set), while EASE takes any group of genes and tries to find which annotations are over-represented in that group. Through the use of the various enrichment techniques, we found the hypoxia gene set to be the most significantly differentially expressed. We next ran a co-expression network analysis on this gene set.
Co-expression network analysis
To understand connection hubs in the hypoxia pathway, a co-expression network was constructed similar to the methodology presented by Zhang and Horvath [27]. The stretch to control expression value ratios were obtained at each time point. Correlations were then obtained for all possible pairs of genes across all the time point ratios. Next we took a threshold of 0.5 of the absolute values of the correlations leaving us with correlations between genes that were higher than 0.5. The genes were then 'connected' to other genes when their pairwise correlations were above this threshold.