- Research article
- Open Access
Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia
BMC Genomicsvolume 9, Article number: 531 (2008)
This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.
We compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.
This paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.
The Stanley Neuropathology Consortium  recently made a large (over 300 sample) data set publicly available on gene expression in the brains of deceased individuals with bipolar disorder or schizophrenia, as well as controls. In addition the data contains limited demographic and clinical history information, including gender and history of smoking, alcohol and drug use. This paper presents a retrospective statistical study on this data set, in which we address the following three questions:
Q1. Can either bipolar disorder or schizophrenia be distinguished from control purely on the basis of gene expression profile?
Q2. Does addition of the demographic and clinical history data further improve the ability to distinguish bipolar disorder or schizophrenia from control?
Q3. Is there a significant difference between the abilities of different widely-used data analysis algorithms to make these distinctions?
We show that bipolar disorder and schizophrenia each can be distinguished from control, based on gene expression alone, significantly better than chance – in fact with areas under the Receiver Operating Characteristic (ROC) curve (AUC) of 0.91 (schizophrenia vs. control) and 0.92 (bipolar disorder vs. control). While area under the ROC curve indicates how well one can distinguish across a range of specificities (with 0.5 being no better than chance and 1.0 being perfect distinction), it is also worth noting that for each task, a sensitivity of 0.85 can be achieved when operating at a specificity of 0.9. Moreover, by taking demographic information and clinical history into account (see Table 1), performance improves to an AUC of 0.94 for schizophrenia vs. control and to an AUC of 0.97 for bipolar disorder vs. control. To our knowledge, this is the first statistical comparison of the efficacy of using a combination of gene expression data and clinical history data against using gene expression data alone. With regard to question Q3, the paper shows that support vector machines (SVMs) significantly outperform the other most widely used algorithms for statistical classification and machine learning for these tasks.
Furthermore, we found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. Given that these variables may affect gene expression, they may make it more difficult to identify genes that are directly associated to the diseases. (We discuss this point in detail later in the text.) We have investigated if post-stratification can correct for such variables, but we found that it significantly reduces the predictive accuracy of the statistical methods.
The expression data set was obtained from the Stanley Neuropathology Consortium . The records utilized in this study are a subset of the entire collection of data. The data set contains 115 schizophrenia patients, 105 patients with bipolar disorder and 112 controls. For each subject, it includes annotated gene expression data and demographic and clinical information. All data was analyzed un-blinded. Diagnosis and criteria have been described previously by Torrey and colleagues . As described in the same report by Torrey and colleagues, the recreational or prescription status of drugs for each of the donors was largely unknown, because the researchers relied on post-mortem urine toxicology screens that are not always conducted.
The expression data was obtained using Affymetrix Human Genome U133A GeneChip oligonucleotide arrays containing 22,283 probe sets (Affymetrix, Santa Clara, CA). Probe level data was summarized using the GC content adjusted robust multi-array average (RMA) method . The data set includes the GC-RMA value of each probe set as a numerical feature.
The data set records, besides expression data, also demographic and clinical information about the subjects. Table 1 lists the recorded demographic and clinical features with their distribution in the three classes. For numeric features, it lists the mean and standard deviation, and for nominal features, the class-wise count of each feature value.
We define two binary classification tasks on the data set: schizophrenia versus control and bipolar versus control. For each task, we compare the following six classification techniques: support vector machines, nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. We briefly describe each technique. The section "Methods" lists the software packages that we use and explains how the parameters of the different algorithms are set.
Support vector machines
Support vector machines (SVMs, ) belong to the family of generalized linear models. We employ linear SVMs, which exhibit good classification performance on gene expression data . A linear SVM is essentially an (n-1)-dimensional hyper-plane that separates the instances of the two classes in the n-dimensional feature space. Figure 1a illustrates this for the two dimensional case: the hyper-plane reduces here to a line, which separates the empty (class 1) and filled (class 2) circles. The hyper-plane maximizes the margin with the closest training instances. These instances are called the "support vectors" because they fix the position and orientation of the hyper-plane. Linear SVMs assume that the training data is linearly separable. If this is not the case, then SVMs rely instead on the concept of a soft margin . In the evaluation, we use a soft margin SVM, which minimizes, in addition to the margin, also the sum of the distances to the training instances that are incorrectly classified by the hyper-plane (the d i in Figure 1a).
Nearest shrunken centroids
Nearest shrunken centroids (NSC, ) is a technique designed for classifying gene expression data. NSC represents each class by its centroid (mean feature vector) and classifies new instances by assigning them the class of the closest centroid. NSC shrinks the class centroids (c i in Figure 1b) in the direction of the overall data centroid. This has the effect that components of a class centroid that after shrinkage are equal to the corresponding components of the overall centroid become irrelevant to the classification process. This occurs for the horizontal component of the class centroids in Figure 1b. As a result, NSC implicitly performs a kind of feature selection.
Decision trees (DTs, ) are tree-shaped symbolic models with tests on the feature values in the internal nodes, and class labels in the leaves (Figure 1c). DTs classify a new instance by sorting it down the tree, according to the tests in the nodes, until it reaches a leaf; the label of the leaf becomes the predicted class of the new instance. C4.5  is a well-known algorithm for constructing decision trees. C4.5 builds a DT top-down, by recursively partitioning the data at each step by a test comparing a feature to a value. At each node, the algorithm selects the test that maximizes a heuristic function called information gain ratio. The better a test is able to separate the instances of the two classes, the higher its information gain ratio. Then, it partitions the training instances based on the selected test, and finally it recursively repeats the same procedure to construct a sub-tree for each subset in the partition. C4.5 creates a leaf if all remaining instances belong to the same class or if there are fewer instances than a user defined threshold. The label of the leaf is the majority class of the instances it covers. After building the tree, C4.5 prunes back some parts to reduce the expected error on new instances.
Ensemble of voters
Ensemble of voters (EOV, ) is a simple ensemble method. An EOV model is a set of decision stumps. Decision stumps are decision trees that consist of precisely one test node with two leaves. The EOV model includes one decision stump for each of the top N feature value tests ranked by the information gain score . To obtain a prediction for a new instance, the model combines the predictions of the stumps by means of majority voting: the predicted class is the class predicted by more than N/2 stumps.
Naïve Bayes (NB, ) is a statistical classifier based on Bayes rule. Its name comes from the strong (naïve) statistical independence assumptions that it makes. In spite of these strong assumptions, it often works remarkably well in practice. NB predicts a class with the rule: . It estimates P(c) and P(feature i = value i |c) from the training data. Note that NB assumes nominal features, which means that numerical features must be discretized prior to running NB.
k-nearest neighbor (k NN, ) classifies a new instance as the majority class of its k closest training instances in the feature space. For example, 3NN in Figure 1d assigns the class "black" to the new instance (indicated with a triangle).
We evaluate the performance of the different classification techniques by means of Receiver Operating Characteristic (ROC) curves. ROC analysis allows us to simultaneously compare classifiers for different misclassification costs and class distributions . It is based on the notions of "true positive rate" (TP, also known as sensitivity or recall) and "false positive rate" (FP, also known as 1.0 – specificity). Given two classes "positive" and "negative", TP rate is the proportion of correctly predicted positive examples, and FP rate is the proportion of negative examples that are incorrectly predicted positive. The vertical axis of a ROC diagram represents TP rate, and the horizontal axis FP rate. Each classifier corresponds to a point on this diagram. The closer the point is to the upper-left corner (TP rate = 1, FP rate = 0), the better the classifier.
Most classifiers provide confidence scores for their predictions. For such classifiers, a ROC curve can be constructed. We present such a curve for each classifier and report the corresponding "area under curve" (AUC), which is defined as the area between the ROC curve and the horizontal axis. To obtain a measure for the predictive performance of the models, we estimate the ROC curves using a 10-fold cross validation procedure. Details about this evaluation procedure can be found in the section "Methods". We use a t-test to assess if the AUC difference between two classifiers is significant and report the corresponding p-value.
Results and discussion
We compare the six classification techniques in the context of two classification tasks: schizophrenia versus control and bipolar versus control. In a first set of experiments, the data consist of only the gene expression features (Figures 2, 3, 4, 5, 6, 7), and in a second batch, the data include both demographic and clinical features as well as gene expression (Figures 8, 9, 10, 11, 12, 13).
Figure 2 compares the classification techniques for the schizophrenia versus control task, using only the gene expression data. SVM outperforms the other techniques. It yields a cross validated AUC of 0.91, which is significantly better than NSC (AUC = 0.71, p = 0.002), DT (AUC = 0.64, p = 0.0001), EOV (AUC = 0.71, p = 0.0001), NB (AUC = 0.71, p = 0.0004), and 3NN (AUC = 0.70, p = 0.0002). The same holds for the bipolar versus control task (Figure 5). SVM (AUC = 0.92) outperforms the other techniques. The second best technique is NSC (AUC = 0.73, p = 0.01).
We also present experiments on data for male subjects only (Figure 3 and Figure 6) and for female subjects only (Figure 4 and Figure 7), to assess if the diseases can be better predicted if data of only one sex is used. The SVM result AUC = 0.91 on the combined data of Figure 2 for schizophrenia versus control is, however, not significantly different from the result for male subjects (Figure 3, AUC = 0.92, p = 0.9) or that for female subjects (Figure 4, AUC = 0.87, p = 0.4). The same holds for the bipolar versus control task. We hypothesize that this is because the data sets with subjects of one sex only are much smaller than the combined data, so that there is less training data for each model. Even if classification is easier for such data, this is offset by the smaller data size.
Figures 8, 9, 10, 11, 12, 13 present a similar set of experiments for the data set that includes demographic and clinical data in addition to the gene expression data. Adding demographic and clinical information improves classification performance. SVM, for example, performs better on the schizophrenia versus control task with demographic information (Figure 8, AUC = 0.94) than without such additional information (Figure 2, AUC = 0.91, p = 0.06). The same holds for the other classification techniques and for the bipolar versus control task (Figure 11 versus Figure 5).
We again present experiments for male and female subjects separately (Figures 9, 10, 12 and 13), this time with the demographic and clinical data included. The conclusion from this set of experiments is similar to the previous conclusion: separating the subjects by sex does not significantly improve the classification performance for SVMs.
The superior performance of SVMs when compared to the other classification algorithms can be understood based on the properties of gene expression data. Gene expression data is typically characterized by a high dimension combined with a relatively low number of samples. For example, the present data set records the expression level of 22,283 probe sets for a number of samples that is two orders of magnitude smaller. Many classification algorithms are known to perform poorly on such high dimensional data. SVMs, on the other hand, are well suited to this setting because their classification performance can be independent of the dimensionality of the feature set : their performance rather depends on the margin with which they separate the samples (Figure 1a). This explains the good performance of SVMs on high dimensional data. Additional empirical evidence is that SVMs are known to perform well on text classification problems (where each word in the vocabulary represents a dimension) . Previous studies on gene expression data also illustrate the good performance of SVMs [5, 9].
Note that the above discussion does not imply that SVMs will always outperform other algorithms on gene expression data. For example, NSC, which implicitly performs dimensionality reduction (recall that it shrinks the class centroids towards the overall data centroid), has also been shown to work well on gene expression data [7, 9]. Therefore, it is common practice in machine learning to evaluate different classification algorithms on a new data set and based on this evaluation select the one that works best. This is also the approach that we follow in this work.
Most relevant features
To asses which features are most relevant to each of the classification tasks, we apply two techniques: (a) ranking the features by their p-value, and (b) ranking the features by their SVM weight. The first technique performs, for each feature, a two-sided t-test comparing the feature's values in the two classes. It then ranks the features by their t-test's p-value. Besides the p-values, we also report q-values . q-values measure significance in terms of the false discovery rate. For example, if all features with a q-value ≤ 5% are called significant, then 5% of these features may be false discoveries, that is, their mean value in the two classes may be actually identical. We use the software QVALUE developed by Storey  to compute the q-values. The second technique ranks the features by the weight that the SVM classifier assigns to each feature in the linear equation of its classification hyper-plane. The larger the absolute value of the SVM weight, the more important the feature is to the classification task.
The QVALUE software computes, in addition to the q-values, also an estimate of the proportion π0 of truly null features. For each of the schizophrenia versus control tasks it estimates π0 to be 1.0, that is, no significant features; for the bipolar versus control tasks, π0 ranges from 0.54 to 0.72. Note that the estimate for schizophrenia versus control is conservative (an overestimate). QVALUE makes certain assumptions about the p-value distribution of the data, which do not hold in this case (cf. Figure 14). It is interesting that, even though QVALUE estimates that there are no significant individual features, it is still possible to build classification models that are highly accurate on previously unseen data. (Recall that SVM yields a cross-validated AUC of 0.91 for schizophrenia versus control.) This is partly because the QVALUE estimate is conservative. But it also is partly because classification techniques do not rely on a single feature, but exploit the combined effect of the set of most relevant features. Therefore, obtaining an accurate classifier is possible even if there are no individual significant features.
Table 2 (schizophrenia versus control) and Table 3 (bipolar versus control) rank the features by p-value. The left panel of each table shows results based on expression data only; the right panel presents results that include the demographic and clinical features as well. Each table consists of three parts: the top part contains the rankings for all subjects combined, the middle part the male subjects' rankings, and the bottom part the female subjects' rankings.(see additional file 3)
Comparing (a) the rankings for expression data only (the left panels of the tables) to (b) the rankings for expression and demographic data (the right panels) shows that similar features appear in (a) and (b). For example, all probe sets that appear in (b) also appear in (a) for Table 2, all subjects. In addition, (b) also includes a number of highly ranked demographic and clinical features. Table 2 shows, for example, that drug use and alcohol use are ranked high for the all and male subjects cases. This indicates that some of the demographic and clinical features are important to the classification tasks. Note that we also observed this while comparing classification models: the models with demographic and clinical features are more accurate.
When comparing the features that appear in the different tables, we observe that for the schizophrenia versus control task (Table 2, expression data), the rankings for all subjects and male subjects have 14 features in common: LBH [GenBank:NM_030915], MT1X [GenBank:NM_002450], MT1X [GenBank:NM_005952], TNFSF10 [GenBank:NM_003810], ABCG2 [GenBank:AF098951], MT1E [GenBank:BF217861], SST [GenBank:NM_001048], CRHBP [GenBank:NM_001882], EMX2 [GenBank:AI478455], NPY [GenBank:NM_000905], MT2A [GenBank:NM_005953], MT1H/P2 [GenBank:NM_005951], SOX9 [GenBank:NM_000346], S100A8 [GenBank:NM_002964].
On the other hand, the rankings for all subjects and female subjects have only one feature (FAM107A [GenBank:NM_007177]) in common. For the bipolar versus control task (Table 3, expression data) all subjects and male subjects share 6 features (SST [GenBank:NM_001048], TNFSF10 [GenBank:NM_003810], NEUROD6 [GenBank:NM_022728], CD74 [GenBank:K01144], NPY [GenBank:NM_000905], TYROBP [GenBank:NM_003332]), and all subjects and female subjects also share 6 features (PPID [GenBank:NM_005038], WTAP [GenBank:BC000383], RND3 [GenBank:BG054844], DNAJA1 [GenBank:NM_001539], KIF2A [GenBank:NM_004520]). For both diseases, there is no overlap between the ranking for the female subjects and that for the male subjects. Possibly of higher interest are the features relevant to both the schizophrenia versus control and bipolar versus control tasks. Comparing the rankings (the top left rankings of Table 2 and Table 3) shows that there are 4 common features: LBH [GenBank:NM_030915], TNFSF10 [GenBank:NM_003810], SST [GenBank:NM_001048], and NPY [GenBank:NM_000905]. These are relevant to both diseases.
Table 4 and Table 5 show rankings based on SVM weights. Also here the relevant features observed for expression and demographic data are similar to those found for expression data only. There is also overlap between the rankings for the different subject subsets (all, male only, and female only). Note, however, that the features identified with the SVM weights are different from those identified with the p-value method. Consider the expression data only, all subjects rankings. For schizophrenia versus control, there are no common features in the rankings produced by the p-value method (Table 2) and the SVM method (Table 4). For bipolar versus control (Tables 3 and 5), there are two shared features: SST [GenBank:NM_001048] and LBH [GenBank:NM_030915]. This difference in rankings arises because the methods essentially have a different goal: the p-value method looks for individual features that distinguish the two classes while the SVM method yields a set of features that together distinguish the classes.
Of the top 20 genes identified using the p-value ranking, 11 have been previously implicated in schizophrenia in at least one study. These genes include: NR4A3 [GenBank:U12767] [16–19], SST [GenBank:NM_001048] , NPY [GenBank:NM_000905] [21, 22], S100A8 [GenBank:NM_002964] , CRH [GenBank:NM_000756] [24, 25], GAD1 [GenBank:NM_013445] [26, 27], FOS [GenBank:BC004490] [28, 29], JUN [GenBank:BG491844] [28, 29], DNAJB1 [GenBank:BG537255] , SLC16A1 [GenBank:AL162079, GenBank:BF511091] , and EGR2 [GenBank:NM_000399] [32, 17].
Overlap with the current literature occurs for bipolar disorder as well, although overlap is not as large primarily because of the relative immaturity of the field and concomitant smaller number of literature results. Of the top 20 genes identified using SVM weight or p-value, 7 genes have been implicated previously in bipolar disorder. Interestingly, multiple probes for the same gene are in the top 20 for DUSP6 [GenBank:BC003143, GenBank:BC005047] [33, 34] and HLA-DRA [GenBank:M60333, GenBank:M60334] . Single probes previously implicated in bipolar disorder include: SST [GenBank:NM_001048] , HLA-A [GenBank:AA573862] , NPY [GenBank:NM_000905] , HLA-DRB3 [Genbank:NM_002125] , and DNAJB1 [GenBank:BG537255] .
Interestingly, most of the remaining genes in the list are known to interact with the genes that have a documented association with either bipolar disorder or schizophrenia. These interactions were determined using Ingenuity Systems software. 14 of the 20 genes in the schizophrenia sample are involved in the same biological pathway (Figure 15). By combining the two networks generated by the software package via 3 overlapping genes, 19 of the 20 genes are in a single biological network. Similarly, 13 of the 20 genes are in a single pathway for bipolar disorder (Figure 16). By combining two of the 3 generated pathways through 3 overlapping genes, this biological network represents 16 of the 20 genes on the list.
One of the more remarkable features of this analysis is the difference in gene expression patterns between males and females. Much speculation has surrounded the role of gender in psychiatric disorders based on morphological and clinical comparisons between affected males and affected females. This analysis may provide further evidence to support and broaden this hypothesis. The most prevalent gender-based differences associated with mental disorders are in the structural abnormalities that have long been known in schizophrenia . These have been validated using CT and MRI scans, demonstrating differences in ventricle size in males and females with schizophrenia; specifically the left ventricles of males are known to be enlarged relative to both their healthy counterparts and affected females [39–45]. Another structure showing a difference in affected males and females is the corpus calosum [46–48]. The temporal lobe appears smaller in affected men than women . Specifically, the superior temporal gyri , the posterior superior temporal gyrus , and Herschel's gyri  have all been shown in one or more studies to be reduced in affected males when compared to their unaffected male counterparts or affected females. Volume reductions have also been observed in the amygdale-hipocampal complex . Reduced asymmetry of the planum temporale has been observed in females in both MRI and post mortem studies [53–56]. In this study we provide additional evidence to further bolster the claim of gender differences, but this new evidence is in the form of molecular differences between affected males and females in both schizophrenia and bipolar disorder. This may all provide evidence that gender may have confounded the results of past molecular analyses into these disorders.
The ranking based on the SVM weights does not produce a significant number of genes previously implicated in schizophrenia or bipolar disorder. This does not necessarily mean this measure does not provide as much biological insight as the ranking based on p-value. The smaller overlap may instead be because the SVM-based method is more different from previous studies than is the method based on p-values. Whereas previous studies sought individual genes, much as the ranking by p-value does, the ranking by SVM weight seeks genes that are predictive in the context of other genes. Therefore it seems likely that this more global look at bipolar disorder and schizophrenia is producing genes missed in previous analyses of microarray data on brain tissue.
Impact of alcohol and drug use
Consider the schizophrenia versus control classification task. Feature rankings by p-value, such as Table 2, may include alcohol use (AU) and drug use (DU) associated genes, and some of these may not be associated to schizophrenia. AU and DU are known to alter gene expression, that is, there are genes that are differently expressed in heavy AU/DU subjects and in low AU/DU subjects. Such AU/DU associated genes will also be differently expressed in the schizophrenia and control classes, simply because there are more high AU/DU subjects in the schizophrenia class, and more low AU/DU subjects in the control class (Table 1). Therefore, AU/DU associated genes may appear in the feature rankings (Table 2).
Identical reasoning applies to the bipolar disorder versus control task, which exhibits a similar difference in AU/DU distribution between the two classes. This should be kept in mind when analyzing the rankings. Note that these differences in distribution are already present in the population and therefore difficult to avoid in the samples.
Post-stratification can potentially be used to remove the confounding effects of variables such as AU and DU. Essentially, post-stratification computes a subset of the data such that the subset's AU/DU distribution is identical in each class. We have applied post-stratification. A detailed description of the method that we used together with its results is available in additional files 1 and 2. In these results however, post-stratification proved ineffective because it significantly reduced the amount of data, and therefore also the power of the statistical methods, resulting in unacceptably high false discovery rates. We briefly quantify this in the following paragraph.
Consider the p-value ranking for the schizophrenia versus control task, expression data only, all subjects (Table 2). Reporting all top 20 features as being significant results in a false discovery rate of 3.8%, that is, fewer than one feature is a false discovery. However, if we compute a similar ranking on the stratified data, which contains only 121 of the 332 samples in the original data , then reporting only the four top-ranked features as significant already yields a false discovery rate of 62.8%, that is, more than two of these four may be false discoveries. Because of this high false discovery rate, we decided not to use stratification in the paper and to accept the possibility of AU/DU associated genes in the rankings. Further analysis, possibly using more data, is required to identify such genes.
Note that this problem is partially mitigated by use of the SVM-based ranking instead, when using demographic features in addition to gene expression. If a gene's correlation with disease is only because of its correlations with AU/DU, then the SVM will prefer to place most/all of the weight on AU and DU rather than on this gene. The gene will receive high weight only if it provides additional predictive ability for disease beyond its association with AU/DU. As a result, the ranking will mostly include genes that are truly associated to the disease, which may explain the difference between the SVM and p-value rankings. The extent of this mitigation requires further study to quantify, which is difficult because we do not have a ground truth to compare to, that is, it would require that we know which genes are directly associated with AU/DU and are not associated with the diseases.
This paper demonstrates that both bipolar disorder and schizophrenia induce substantial changes in gene expression within the brain – substantial enough that each can be distinguished from normal control with an area under the ROC curve of over 0.9. The paper also demonstrates the utility of combining gene expression and clinical data. To our knowledge, this is the first time such a combination has been employed on this scale. Finally, the paper demonstrates the significant advantage of support vector machines for this task over other widely used algorithms from statistical classification and machine learning.
Using these classification schemes we have shown an overlap with the current literature when ranking the genes according to p-value. In fact, nearly the entire schizophrenia and the entire bipolar disorder list are either indicated in the literature or are involved in a biological network with a previously implicated gene. However, when ranking the genes based on SVM weight only 1 or 2 genes out of 20 on each list overlapped with the current literature. This does not necessarily imply that these methods are not viable but rather that these may be previously unidentified candidates that have risen to the top due to the large sample size of this analysis and the application of alternative classification algorithms for microarray data analysis.
This paper also discussed the possible impact of variables such as alcohol and drug use on the presented gene rankings. Post-stratification can correct for such variables, but it significantly reduces the power of the statistical methods. Therefore, data for more controls with high values for these variables is required so that the AU/DU distributions in the different classes become more similar. It should also be kept in mind that this is a retrospective study on a given data set and that all results will require further clinical validation. Samples in the collection were matched during the collection process for as many parameters as possible. While there may be some confounding effects from pre-mortem consumption of alcohol and other non-prescription medication, this analysis is our best attempt to account for differences in the factors through analytical means (Torrey, Webster, et al 2000). Since, these samples are not derived from controlled animals models we will have to rely on these analytical means to aid our efforts in dissecting the root causes of complex diseases.
Encoding of nominal features
Because most classification techniques that we consider are restricted to numerical features only, we re-encode each nominal feature as a numeric feature. Table 1 indicates the encoding after each feature value. For the binary features "Sex", "Left brain", "Brain region", and "Smoking at time of death", we use an encoding with three values: 1, 0, and -1, with 0 indicating "unknown". The features that have essentially ordered values, such as "Alcohol use", "Drug use" and "Rate of death", we encode with a simple integer encoding.
Software packages and versions
We choose the SVM-light software version 6.01  for constructing linear soft margin SVMs. We use the NSC software PAM version 1.28, which is available at: http://www-stat.stanford.edu/~tibs/PAM/. We use the reimplementation of C4.5 that is available in the Weka data mining tool version 3.4.4 . We also use Weka 3.4.4's implementations of NB and k NN. We use the EOV software version 1.0 by Hardin et al. . To compute the q-values, we use the software QVALUE version 1.1 developed by Storey . All are freely available.
ROC curves and AUC
Most classifiers provide confidence scores for their predictions. The classification behavior of such classifiers can be modified by applying a threshold to this score: only predict positive if the confidence is above the given threshold. By varying the threshold, we obtain different ROC points, which can be connected into a curve. (The curve can then be used, for example, to select an appropriate threshold.) We present such a curve for each classifier and report the corresponding AUC.
To obtain the ROC curves, we use 10-fold cross-validation (CV). 10-fold CV is often used to evaluate the predictive performance of classifiers if the number of instances is small. In this situation, 10-fold CV results in a lower-variance estimate of error than does the use of a single held-aside test set. 10-fold CV consists of three steps: (a) partition the data set D into 10 subsets T i ; (b) train 10 classifiers on the training sets D-T i ; (c) test classifier i on test set T i . We pool the predicted confidence values of the classifiers over the 10 test sets to construct the ROC curve. The CV algorithm that we employ is stratified, which means that it ensures that the T i have identical class distributions, or as nearly identical as possible.
To assess statistical significance when comparing two classification techniques by AUC value, we use a two-sided paired t-test. The paired sample values used in the test are the AUC values computed for the two techniques on the 10 CV test sets.
Most classification techniques come with a number of parameters. We set all parameters to their default values, except for the following. The parameter C of SVM-light, which controls the contribution of the misclassified examples, is set to 1.0 (SVM-light is not particularly sensitive to C's value). We enable Laplace smoothing of DT's confidence values. Following Hardin et al. , we set EOV's number of decision stumps N to 20. We enable Weka's discretization feature for NB. We run k NN with k = 3 neighbors.
We tune NSC's Δ parameter, which controls the amount of shrinkage, by means of 10-fold CV, as suggested by Tibshirani et al. . Recall that we also create ROC curves by means of CV. To avoid overfitting the test set, we repeat the parameter tuning each time we run NSC, that is, for each ROC CV fold. The tuning selects the value of Δ that maximizes TP + TN, with TP the true positive rate and TN the proportion of negative examples that are correctly classified.
The performance of some classification techniques can be improved by running a feature selection method prior to constructing the classifier. We implement feature selection for SVM, NB, and 3NN. The feature selection works as follows. We perform a two-sided paired t-test for each feature comparing its value in the two classes. Then we rank the features by their t-test's p-value and retain the 10% features that most significantly differ in the two classes (similar to the p-value ranking discussed before). We repeat the feature selection for each CV fold and perform the t-test on the corresponding training set.
To compute the feature ranking by SVM weight, we normalize each feature by subtracting its mean and dividing by its standard deviation (in the data set at hand), and run the SVM algorithm on the transformed data. The rationale behind this is to avoid favoring features with a small value range in the ranking. We also enable feature selection to construct the ranking by SVM weight as discussed before.
Description of additional data files
• FeatureRankingsDetailed.xls (Microsoft Excel format): "Feature rankings with additional information" in additional file 3.
The Stanley Neuropathology Consortium. [http://www.stanleygenomics.org]
Torrey EF, Webster MJ, Knable MB, Johnston N, Yolken RH: The Stanley Foundation brain collection and Neuropathology Consortium. Schizophr Res. 2000, 44: 151-155.
Wu Z, Irizarry RA: Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J Comput Biol. 2005, 12: 882-893.
Vapnik V: Statistical Learning Theory. Wiley Series on Adaptive and Learning Systems for Signal Processing, Communications, and Control. 1998, New York, NY: John Wiley & Sons
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000, 16: 906-914.
Cortes C, Vapnik V: Support-vector networks. Mach Learn. 1995, 20: 273-297.
Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002, 99 (10): 6567-6572.
Quinlan JR: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning. 1993, New York, NY: John Wiley & Sons
Hardin J, Waddell M, Page CD, Zhan F, Barlogie B, Shaughnessy J, Crowley JJ: Evaluation of multiple models to distinguish closely related forms of disease using DNA microarray data: an application to multiple myeloma. Stat Appl Genet Mol Biol. 2004, 3 (Article10):
Domingos P, Pazzani MJ: On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn. 1997, 29: 103-130.
Aha DW, Kibler DF, Albert MK: Instance-based learning algorithms. Mach Learn. 1991, 6: 37-66.
Provost FJ, Fawcett T: Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining: 14–17 August 1997; Newport Beach. Edited by: Heckerman D, Mannila H, Pregibon D. 1997, Menlo Park, CA: AAAI Press, 43-48.
Joachims T: Text categorization with suport vector machines: learning with many relevant features. Lecture Notes in Computer Science. 1998, 1398: 137-142.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. P Natl Acad Sci USA. 2003, 100: 9440-9445.
Storey JD: A direct approach to false discovery rates. J Roy Stat Soc B. 2002, 64: 479-498.
Laflamme C, Filion C, Labelle Y: Functional characterization of SIX3 homeodomain mutations in holoprosencephaly: interaction with the nuclear receptor NR4A3/NOR1. Hum Mutat. 2004, 24: 502-508.
Nichols CD, Sanders-Bush E: A single dose of lysergic acid diethylamide influences gene expression patterns within the mammalian brain. Neuropsychopharmacology. 2002, 26 (5): 634-642.
Wansa KD, Muscat GE: TRAP220 is modulated by the antineoplastic agent 6-Mercaptopurine, and mediates the activation of the NR4A subgroup of nuclear receptors. J Mol Endocrinol. 2005, 34: 835-848.
Werme M, Ringholm A, Olson L, Brene S: Differential patterns of induction of NGFI-B, Nor1 and c-fos mRNAs in striatal subregions by haloperidol and clozapine. Brain Res. 2000, 863: 112-119.
Nakatani N, Hattori E, Ohnishi T, Dean B, Iwayama Y, Matsumoto I, Kato T, Osumi N, Higuchi T, Niwa S, Yoshikawa T: Genome-wide expression analysis detects eight genes with robust alterations specific to bipolar I disorder: relevance to neuronal network perturbation. Hum Mol Genet. 2006, 15: 1949-1962.
Itokawa M, Arai M, Kato S, Ogata Y, Furukawa A, Haga S, Ujike H, Sora I, Ikeda K, Yoshikawa T: Association between a novel polymorphism in the promoter region of the neuropeptide Y gene and schizophrenia in humans. Neurosci Lett. 2003, 347: 202-204.
Hashimoto T, Arion D, Unger T, Maldonado-Avilés JG, Morris HM, Volk DW, Mirnics K, Lewis DA: Alterations in GABA-related transcriptome in the dorsolateral prefrontal cortex of subjects with schizophrenia. Mol Psychiatry. 2008, 13 (2): 147-161.
Roche S, Cassidy F, Zhao C, Badger J, Claffey E, Mooney L, Delaney C, Dobrin S, McKeon P: Candidate gene analysis of 21q22: support for S100B as a susceptibility gene for bipolar affective disorder with psychosis. Am J Med Genet B Neuropsychiatr Genet. 2007, 144B (8): 1094-1096.
Holsboer F: The role of peptides in treatment of psychiatric disorders. J Neural Transm-Supp. 2003, 64: 17-34.
De Wied D, Sigling HO: Neuropeptides involved in the pathophysiology of schizophrenia and major depression. Neurotox Res. 2002, 4: 453-468.
Zhao X, Qin S, Shi Y, Zhang A, Zhang J, Bian L, Wan C, Feng G, Gu N, Zhang G, He G, He L: Systematic study of association of four GABAergic genes: glutamic acid decarboxylase 1 gene, glutamic acid decarboxylase 2 gene, GABA(B) receptor 1 gene and GABA(A) receptor subunit beta2 gene, with schizophrenia using a universal DNA microarray. Schizophr Res. 2007, 93: 374-384.
Akbarian S, Huang HS: Molecular and cellular mechanisms of altered GAD1/GAD67 expression in schizophrenia and related disorders. Brain Res Rev. 2006, 52: 293-304.
Spiliotaki M, Salpeas V, Malitas P, Alevizos V, Moutsatsou P: Altered glucocorticoid receptor signaling cascade in lymphocytes of bipolar disorder patients. Psychoneuroendocrinology. 2006, 31 (6): 748-760.
Brunello N, Tascedda F: Cellular mechanisms and second messengers: relevance to the psychopharmacology of bipolar disorders. Int J Neuropsychopharmacol. 2003, 6 (2): 181-189.
Iwamoto K, Kakiuchi C, Bundo M, Ikeda K, Kato T: Molecular characterization of bipolar disorder by comparing gene expression profiles of postmortem brains of major mental disorders. Mol Psychiatry. 2004, 9 (4): 406-416.
Thwaites DT, Anderson CM: H+-coupled nutrient, micronutrient and drug transporters in the mammalian small intestine. Exp Physiol. 2007, 92: 603-619.
Yamada K, Gerber DJ, Iwayama Y, Ohnishi T, Ohba H, Toyota T, Aruga J, Minabe Y, Tonegawa S, Yoshikawa T: Genetic analysis of the calcineurin pathway identifies members of the EGR gene family, specifically EGR3, as potential susceptibility candidates in schizophrenia. Proc Natl Acad Sci U S A. 2007, 104 (8): 2815-2820.
Lee KY, Ahn YM, Joo EJ, Chang JS, Kim YS: The association of DUSP6 gene with schizophrenia and bipolar disorder: its possible role in the development of bipolar disorder. Mol Psychiatry. 2006, 11 (5): 425-426.
Carter CJ: Multiple genes and factors associated with bipolar disorder converge on growth factor and stress activated kinase pathways controlling translation initiation: implications for oligodendrocyte viability. Neurochem Int. 2007, 50: 461-490.
Kang BJ, Park SW, Chung TH: Can the expression of histocompatibility antigen be changed by lithium?. Bipolar Disord. 2000, 2: 140-144.
Kuromitsu J, Yokoi A, Kawai T, Nagasu T, Aizawa T, Haga S, Ikeda K: Reduced neuropeptide Y mRNA levels in the frontal cortex of people with schizophrenia and bipolar disorder. Brain Res Gene Expr Patterns. 2001, 1: 17-21.
Yu YQ, Yu Q, Guo YJ, Sang H, Shi JP, Liu SZ, Wei J: Study on the genetic association between DRB3 and DRB1 loci in the human MHC region and psychotic symptoms of schizophrenia. Zhonghua Liuxingbingxue Zazhi. 2003, 24: 815-818.
Lencz T, Cornblatt B, Bilder RM: Neurodevelopmental models of schizophrenia: pathophysiologic synthesis and directions for intervention research. Psychopharmacol Bull. 2001, 35: 95-125.
Flaum M, O'Leary DS, Swayze VW, Miller DD, Arndt S, Andreasen NC: Symptom dimensions and brain morphology in schizophrenia and related psychotic disorders. J Psychiatr Res. 1995, 29: 261-276.
Buchanan RW, Vladar K, Barta PE, Pearlson GD: Structural evaluation of the prefrontal cortex in schizophrenia. Am J Psychiatry. 1998, 155: 1049-1055.
Haas GL, Garratt LS, Sweeney JA: Delay to first antipsychotic medication in schizophrenia: impact on symptomatology and clinical course of illness. J Psychiatr Res. 1998, 32: 151-159.
Haas HL, Selbach O: Functions of neuronal adenosine receptors. Naunyn Schmiedebergs Arch Pharmacol. 2000, 362: 375-381.
Harvey PD: Cognitive impairment in elderly patients with schizophrenia: age related changes. Int J Geriatr Psychiatry. 2001, 16 Suppl 1: S78-S85.
Ho BC, Andreasen NC, Nopoulos P, Arndt S, Magnotta V, Flaum M: Progressive structural brain abnormalities and their relationship to clinical outcome: a longitudinal magnetic resonance imaging study early in schizophrenia. Arch Gen Psychiatry. 2003, 60: 585-594.
Nopoulos P, Flaum M, Andreasen NC: Sex differences in brain morphology in schizophrenia. Am J Psychiatry. 1997, 154: 1648-1654.
Innocenti GM, Ansermet F, Parnas J: Schizophrenia, neurodevelopment and corpus callosum. Mol Psychiatry. 2003, 8: 261-274.
Raine A, Benishay D, Lencz T, Scarpa A: Abnormal orienting in schizotypal personality disorder. Schizophr Bull. 1997, 23: 75-82.
Nasrallah HA, Sharma S, Olson SC: The volume of the entorhinal cortex in schizophrenia: a controlled MRI study. Prog Neuropsychopharmacol Biol Psychiatry. 1997, 21: 1317-1322.
Bogerts B, Ashtari M, Degreef G, Alvir JM, Bilder RM, Lieberman JA: Reduced temporal limbic structure volumes on magnetic resonance images in first episode schizophrenia. Psychiatry Res. 1990, 35: 1-13.
Reite M, Sheeder J, Teale P, Adams M, Richardson D, Simon J, Jones RH, Rojas DC: Magnetic source imaging evidence of sex differences in cerebral lateralization in schizophrenia. Arch Gen Psychiatry. 1997, 54: 433-440.
Rojas DC, Teale P, Sheeder J, Simon J, Reite M: Sex-specific expression of Heschl's gyrus functional and structural abnormalities in paranoid schizophrenia. Am J Psychiatry. 1997, 154: 1655-1662.
Yamasue H, Fukui T, Fukuda R, Yamada H, Yamasaki S, Kuroki N, Abe O, Kasai K, Tsujii K, Iwanami A, Aoki S, Ohtomo K, Kato N, Kato T: 1H-MR spectroscopy and gray matter volume of the anterior cingulate cortex in schizophrenia. Neuroreport. 2002, 13: 2133-2137.
Hoff AL, Wieneke M, Faustman WO, Horon R, Sakuma M, Blankfeld H, Espinoza S, DeLisi LE: Sex differences in neuropsychological functioning of first-episode and chronically ill schizophrenic patients. Am J Psychiatry. 1998, 155: 1437-1439.
DeLisi LE: Structural brain changes in schizophrenia. Arch Gen Psychiatry. 1999, 56: 195-196.
Falkai P, Schneider-Axmann T, Honer WG: Entorhinal cortex pre-alpha cell clusters in schizophrenia: quantitative evidence of a developmental abnormality. Biol Psychiatry. 2000, 47: 937-943.
Frangou S, Sharma T, Alarcon G, Sigmudsson T, Takei N, Binnie C, Murray RM: The Maudsley Family Study, II: endogenous event-related potentials in familial schizophrenia. Schizophr Res. 1997, 23: 45-53.
Joachims T: Making large-scale SVM learning practical. Advances in Kernel Methods – Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. 1999, Cambridge, MA: MIT Press, 169-184.
Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2005, San Mateo, CA: Morgan Kaufmann, 2
Storey JD: Non-uniform null p-values and strange looking p-values. 2007, The QVALUE software Google group, [http://groups.google.com/group/qvalue]
Research supported by the US NSF grant IIS 0534908 and the US NIH grant R01 CA127379. Postmortem brain tissue was donated by The Stanley Medical Research Institute's brain collection. JS is a post-doctoral fellow of the Research Foundation – Flanders (FWO-Vlaanderen). DP is supported by the US NSF grant IIS 0534908.
JS performed the statistical analysis and drafted the manuscript. SD provided the data, provided critical input and drafted the biological relevance section of the manuscript. DP supervised the study and helped drafting the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.