Data Sets
All microarray data sets were downloaded from the NCBI GEO public archive, generated in Affymetrix Human Genome U133 Plus 2.0 platform with 54675 probe sets on chips. The raw data (CEL file) were normalized by RMA algorithm using the 'affy' package of the Bioconductor http://www.bioconductor.org[17] software suite in the R Project for statistical Computing http://www.r-project.org. See Materials S1 in Additional file 1 for details.
Embryonic Stem Cell Neurogenesis data set (ESCN)
This data set contains 27 samples over five periods of human embryonic development: three embryonic stem cell (ESC) samples, three embryoid body (EB) samples, six primitive ectoderm cell (PEL) samples, six neural tube-like rosette cell samples, and nine post-natal neural stem cell (NSC) samples.
Embryonic Stem Cell Vasculogenesis data set (ESCV)
In this data set, there are 13 samples over four periods of human embryonic stem cell differentiation into human mature (vascular) endothelial cells: three undifferentiated embryonic stem cell (ESC) samples, three mesodermal progenitor cell (MPC) samples, four embryoid body (EB) samples, and three human mature vascular endothelial cell (VEC) samples. The results and discussion of application of MFSelector to this data set are exhibited in Materials S1 (in Additional file 1).
Synthetic data sets
We generated two sets of synthetic data. These data sets contain 50 samples each spread over five equal sized stages. There is a set with descending trends (denoted 'Des') and a set with ascending trends (denoted 'Asc'). These synthetic data sets are named 's50_Asc' and 's50_Des', respectively. In addition, each data set has 120 genes which are classified into 9 types of monotonic genes (or monotonic-like genes): 'Good (distinct)', 'Good (close)', 'Slightly', 'Outliers (slight)', 'Outliers (severe)', 'Moderately', 'Severely', 'Partially ordered (far)', and 'Partially ordered (close)'. There are 20 genes for each of 'Slightly', 'Moderately', and 'Severely' types and 10 genes for each of 'Good (distinct)', 'Good (close)', 'Outliers (slight)', 'Outliers (severe)', 'Partially ordered (far)', and 'Partially ordered (close)'. If most of the samples follow a trend over time/stages but only a few do not, we call those few samples as outliers. In this study the number of outliers for the synthetic data sets is restricted to less than 6% of total samples.
Here we use genes with a monotonic ascending trend to illustrate these nine types of genes used in the synthetic data as shown in Figure 1. In this study, genes with a monotonic ascending trend are those where the average expression values of samples in Stage One are lower than those of samples in Stage Two, the average expression values of samples in Stage Two are lower than those of samples in Stage Three, and so on. In the synthetic data sets, 'Good' type genes have the strongest degree of monotonicity followed in order with 'Slightly', 'Moderately', and 'Severely' monotonic type genes. For 'Good' type genes, there is no overlapping of samples between stages, and for 'Severely' type genes, the samples in one stage have samples that severely overlap the samples in the adjacent/other stages. In addition, 'Partially ordered' type genes indicate that the samples in four of the five stages form a monotonic ascending trend, while the samples in the remaining stage overlap the samples in other stages. 'Outliers' type genes denote that samples in the five stages form a monotonic ascending trend, but some of the samples in one of the five stages are slightly/significantly away from the expression values of other samples in the same stage.
The 'Good' category has two subcategories, 'Good (distinct)' and 'Good (close)'. 'Good (distinct)' genes indicate they have the strongest monotonic ascending trend and samples have distinct separation between stages; while 'Good (close)' genes indicate they still have the strongest monotonic trend but samples between stages are close. 'Partially ordered' type genes are also further separated into 'Partially ordered (far)', and 'Partially ordered (close)'. The 'Partially ordered (far)' means that the samples in one of the stages are significantly far from the monotonic trend in a gene expression profile while 'Partially ordered (close)' means that the samples in one of the stages are close to the monotonic trend but it is still out of monotonic order in the gene expression profile. The genes of type 'Outliers (severe)' represent genes where at most six percent of the samples are relatively away from the monotonic trend, and type 'Outliers (slight)' indicates that at most six percent of the samples are slightly away from the monotonic trend. The results on these synthetic data sets by our proposed method (MFSelector) and other two methods are presented in the Results section.
Monotonic Feature Selector (MFSelector)
In this study, we propose a novel tool, Monotonic Feature Selector (MFSelector) which includes a novel index, called DEtotal (total discriminating error), used to identify genes/biomarkers with stronger monotonic features which may bear a correlation to stem cell development. DEtotal does not make any distributional assumption about the data. MFSelector also provides the related statistical information (p- and q- value) for each monotonic gene. Moreover, in order to distinguish among monotonic genes with the same DEtotal, MFSelector also offers an additional novel index, called SVDE (sample variance for discriminating error). Users can download MFSelector from the web site: http://microarray.ym.edu.tw/tools/MFSelector/. Figure 2 depicts the processing steps involved in the computation of DEtotal, p- and q- values, and SVDE.
Computation of discriminating error for each level of a gene
Here we use several steps to examine genes with monotonically ascending or descending profiles, which are distributed over N stages. For example, when a monotonic gene has an ascending profile, first we assume samples in Stage One have the lower expression values and samples in other stages have higher expression values. We use a sample in Stage One to draw a horizontal discriminating line to determine how many samples in Stage One have higher expression values than the discriminating line and how many samples in other stages have lower expression values than the discriminating line. The number of samples on the wrong side of the discriminating line is referred to as "discriminating errors". Distinguishing Stage One from all other stages is called the "Level One" process. In addition, since every sample in Stage One can be used to draw a discriminating line, we select the discriminating line with the smallest number of discriminating errors for this level. Note that, when more than one discriminating line for Stage One have the same number of discriminating errors, we can choose any one of the lines; here we use the discriminating line corresponding to the lowest sample number. Secondly, again for the same gene, as a "Level Two" process, we assume samples in Stages One and Two have lower expression patterns than samples in other stages. Therefore, we use each sample from Stages One and Two taken together to draw a discriminating line and determine discriminating error as in Level One. When more than one sample from Stages One and Two together result in the same smallest number of discriminating errors, we consider: (a) when the samples are from the same stage, then we select the line (hence the sample) corresponding to the lowest sample number; (b) when the samples are from different stages, we give priority to the lowest sample number from the highest stage, in this case, Stage Two. This entire process is repeated N-1 times, each time adding one stage (e.g., for the "Level Three" process the union of Stages One, Two and Three versus all other stages; for the "Level Four" process the union of Stages One, Two, Three and Four against all other stages, and so on until the "Level N-1"). The discriminating line is selected in the same manner for each of the N-1 levels (in total, N-1 discriminating lines are considered for a gene). It is interesting to note that, if the number of distinct discriminating lines is equal to N-1, the corresponding gene is more likely to be a good monotonic gene.
Our outlier strategy is different from that used by rank-based statistical tests (e.g., Cuzick-test), where an outlier's deviation in expression value influences its rank, often adversely affecting the statistical calculation. MFSelector does not count samples on the wrong side of the discriminating line more than once. Suppose an outlier adds a discriminating error in the Level M process. This outlier is very likely to add a discriminating error in later levels, i.e., "Level M+1" to "Level N-1"). But we count the outlier's discriminating error only once at Level M. We compare the results for the Cuzick-test and ours in the Results section.
For a monotonic gene with a descending profile, samples in Stage One should have the higher expression values than the samples in other stages. Thus, we can get the "discriminating errors" for "Level One" process by drawing a discriminating line through one of the samples in Stage One to determine the number of samples in Stage One that have lower expression values than the discriminating line and the number of samples in other stages having higher expression values than the discriminating line. As with an ascending profile, this process is repeated N-1 times, though in this case, in descending direction. The discriminating line for each level is also selected in the same manner as that for the ascending case.
Computation of total Discriminating Error (DEtotal) for a gene
In order to evaluate monotonicity of a gene (e.g., genes with weak or strong monotonic features), we use the sum of its discriminating errors over all N-1 level processes (total discriminating error) denoted by "DEtotal". For instance, when DEtotal=k, k discriminating errors exist over N-1 level processes. Here, the smaller DEtotal of a gene, the fewer the number of discriminating errors and the stronger the monotonicity of the gene. DEtotal is an evaluation index to determine monotonic genes. We sort the DEtotal values for all genes in increasing order to select genes with stronger monotonic features for multi-stage and time-series array data. Note that, in this study we separately consider the DEtotal for monotonically ascending or descending cases. In addition, we also develop another strategy using sample variance for discriminating error (SVDE) to further determine the gene with stronger monotonic feature when several genes have the same DEtotal value. This is explained later.
Calculating DEtotalfor a gene: an illustration
We now demonstrate our algorithm using a synthetic gene profile with five stages as shown in Figure 3. Suppose in this case, this gene has an ascending nature. We perform the following steps for this gene as mentioned above.
First, for the "Level One" process, every sample in Stage One is used to draw a discriminating line to determine its associated number of discriminating errors (the number of samples in Stage One that have higher expression values than the discriminating line and the number of samples in other stages having lower expression values than the discriminating line). In this case, we obtain 7, 2, and 2 discriminating errors taking each of three samples in order from Stage One to draw the discriminating lines. For example, one sample from Stage One and another sample from Stage Four are on the wrong side of the discriminating line when considering the 2nd sample in Stage One to draw the discriminating line, resulting in two discriminating errors. In addition, since two discriminating lines (i.e., corresponding to the 2nd and 3rd samples) from Stage One have the same number of discriminating errors, we choose the sample with the lowest sample number (i.e., 2nd sample) and get two discriminating errors in this level.
Second, for the "Level Two" process, every sample in Stages One and Two is used to draw a discriminating line to determine its associated number of discriminating errors. In this level, we obtain 3, 3, 4, 0, 2, and 1 discriminating errors taking each of six samples in order from Stages One and Two to draw the discriminating lines. Note that, to compute the discriminating errors for the "Level Two" process, we do not consider the samples which have already caused discriminating errors in "Level One". Such samples are masked pink in Figure 3(B), one sample from Stage One and another sample from Stage Four. The extent of outliers may strongly influence results in other methods, but in our method we consider "outliers" only once to contribute to the discriminating errors and hence it allows us to determine monotonicity without the results being skewed by the position of "outliers". This is why our method has some advantage over other approaches. Finally, we choose the sample with the lowest number of discriminating errors (i.e., the 4th sample from Stage One and Two) and get no discriminating errors in this level.
Third, for the "Level Three" process, every sample in Stages One, Two, and Three is used to draw a discriminating line to determine its associated number of discriminating errors. We get 3, 9, 10, 6, 8, 7, 2, 4, 5, 1, 3, and 1 discriminating errors taking each of 12 samples in order from Stage One to Stage Three. The 10th sample is selected as the discriminating line because its discriminating error is the lowest, and it has the lowest sample number of the two Level Three samples with the same discriminating error of one.
Similarly, the process is continued for the "Level Four". In this case we determine the 16th sample with no discriminating error. Finally, we add up all discriminating errors for each selected discriminating line from every level and get the DEtotal = 2 + 0 + 1 + 0 = 3.
Statistical analysis of monotonic genes
Computation of p-value and q-value
To assess the statistical significance of the DEtotal index associated with the identified genes in ascending and descending profiles, a permutation test has been performed in Figure 2. Both unadjusted p-values and adjusted q-values for multiple comparisons are computed as in our previous studies [7, 8]. Let G be the total number of genes and S be the total number of sample points. The procedure followed is summarized below.
Step 1. Given a gene expression matrix X (x
gs
is the expression intensity of gene g in sample unit s; 1 ≤ g ≤ G, 1 ≤ s ≤ S) with class labels (y
s
, 1 ≤ s ≤ S), we compute the DEtotal,g for ascending case and denotes it by a
g
for notational simplicity. Similarly, for the descending case also we compute DEtotal,g and denote it by d
g
.
Step 2. Randomly permute the class labels y
s
B times (500 times in this study). In the bth permutation (1 ≤ b ≤ B), compute and for gene g using the gene expression matrix X and the permuted labels .
Step 3. The p-value of the observed DEtotal,g with ascending characteristic, a
g
, for gene g is
(1)
where I(·) is an indicator function that takes the value one when true and takes the value zero when false. Similarly, the p-value of the observed DEtotal,g with descending nature, d
g
, is
(2)
Step 4. To account for the multiple tests being performed for the G genes, q-values of the observed a
g
and d
g
are calculated as
(3)
and
(4)
Computation of sample variance for discriminating error (SVDE)
Statistical tests are unable to distinguish which genes with the same level of statistical significance are better. Two genes with the same DEtotal may exhibit different monotonicities. Therefore, we develop SVDE to address the degree of monotonicity of a gene. This can also help to differentiate between genes with the same DEtotal value.
For example, in the ESCN data set, LOC100506013 and FAM60A are monotonically descending genes and have the same DEtotal = 0 (shown in Figure 4). However, the expressions of the samples of FAM60A from Stages One to Four are tightly grouped (albeit not overlapped) and of LOC100506013 from Stages One to Four are highly distinguished, and only the expression values of the samples from Stages Four and Five are somewhat close. For these two genes with the same DEtotal, LOC100506013 should have a higher degree of monotonicity (Figure 4(A)) than FAM60A (Figure 4(B)).
In order to assess the degree of monotonicity (particularly when more than one gene have the same DEtotal), all samples for each of the genes are slightly altered in expression values to examine whether the DEtotal of the altered expression values has changed significantly or not. To evaluate this, we propose an index, called Sample Variance for Discriminating Error (SVDE). We perturb a dataset and evaluate the extent of confidence on the monotonicity properly by adding noise to all samples (as shown in Additional file 2: Fig. S1), and apply the same method to the perturbed a dataset to calculate the DEtotal of genes.
To perturb a dataset with m samples and n genes, we first compute the standard deviation σ
i
of each gene x
i
and divide it by 10 to compute σ'
i
. Next, we generate m noise values from a normal distribution with mean equal to 0 and standard deviation equal to σ'
i
for each gene. Finally, we add such a random noise to every observed value of gene x
i
. The noise corrupted gene is used to compute its discriminating errors for each level. Let the resulting total discriminating error for this noisy gene be DEtotal. We repeat this procedure in the same manner M times (M = 100 in this study), and get M new DEtotal values for each gene. We denote the new DEtotal value as the DE
i
, i = 1, 2, ..., M. Next we compute the variance of these DE
i
values with respect to original DEtotal for this gene. We denote the original DEtotal as DEorg. Hence for these new DEtotal values we can calculate the variance as
(5)
The SVDE is a measure of how far the set of the new DEtotal values are spread around the original DEtotal (DEorg). If each DE
i
is equal to or close to DE
org
, the SVDE will be very small. Figure 4(A) shows how hard it is to change the DEtotal of a highly confident monotonic gene even after randomly altering the expression values of all samples. The smaller the SVDE, the stronger the monotonicity of the gene. Figure 4(B) depicts the monotonic profile of another gene where DEtotal is equal to zero, but the SVDE is high. The reason is that separation between adjacent stages for Stages One to Four is not sufficiently distinct. Consequently, genes with a smaller SVDE can be considered more strongly monotonic than genes with a larger SVDE.
One may think that the time interval between observations should be given importance in defining any index for finding such stemness markers. This is not so because here our objective is to discover monotonic stemness marker genes which are associated with stages of stem cell differentiation. These stages are clinical stages which depend on pathology/clinical phenotypes and are not defined taking observations with arbitrary/equal time interval, but the biological process here dictates the time interval between observations. However, our method would be equally applicable to other experimental scenario where the researchers design experiments to take observations with equal time interval for some scientific reasons. It does not matter if the required time interval is large or small.
Cuzick-test
In order to further validate the ability of our method to identify monotonic genes, we have made a comparison with the Cuzick-test [15], an extension of a Wilcoxon-type statistical test for discovering specific patterns with pre-designed trends, such as oscillation, step-wise patterns, and in this case monotonicity. Here we use the Cuzick-test to compute the Cuzick value (z-score) for the gene expression data for each gene as follows. Let n be the total number of samples from all classes, r
i
be the rank of the ith sample determined by the order of its expression value among all samples, N be the number of stages, and p
i
be the proportion of samples from class i. Each stage has a weight w
k
, k = 1,...,N. We associate a weight z
i
to a sample i, z
i
= w
j
, if the ith sample is from the jth stage. The Cuzick statistic is computed as
(6)
where
(7)
(8)
(9)
To find ascending genes, we use w
i
= i, i = 1, ..., N; while for descending genes, we use w
i
= N-i+1, i = 1, ..., N. The Cuzick statistic above is obtained as a z-score, hence, we can determine a p-value for each gene and identify whether a gene is monotonic and the extent of its monotonicity (a stronger monotonic gene with a smaller p-value) by referring to the z-score table. In this study, we compare the performance of the Cuzick-test approach and our MFSelector approach (shown in the Results section).
The modified M statistic test
We have also performed a comparison with the modified M statistic test [16]. The modified M statistic test (a distribution-based statistical test) is used to analyze dose-response studies in microarray experiments and investigate a trend in the response level expression with respect to doses in multi-class array data. In order to effectively find monotonic genes, the modified M statistic test adopts the means ordering strategy that considers the order restriction of mean expressions (responses) regarding the increasing or decreasing doses through K stages, respectively. The modified M statistic test and several other statistical tests are further integrated into an R package called IsoGene [18]. However, the modified M statistic test is less effective because the existence of outliers in a stage strongly influences the order of mean responses (as shown in the Results section).
Functional annotation
After getting the list of monotonic genes, we can sort them by DEtotal or p-value in ascending order to identify those genes with stronger monotonic patterns, and decide the threshold for the gene list based on monotonicity. In addition, we further sort these monotonic genes with the same DEtotal values according to their associated SVDE. The resulting gene list is fed into the web service tool DAVID Bioinformatics Resources http://david.abcc.ncifcrf.gov/ to identify correspondingly enriched biological processes by cross-referencing to the gene ontology database http://www.geneontology.org/.