Genetical genomics: use all data
 Miguel PérezEnciso^{1, 2}Email author,
 José R Quevedo^{3} and
 Antonio Bahamonde^{3}
DOI: 10.1186/14712164869
© PérezEnciso et al; licensee BioMed Central Ltd. 2007
Received: 14 November 2006
Accepted: 12 March 2007
Published: 12 March 2007
Abstract
Background
Genetical genomics is a very powerful tool to elucidate the basis of complex traits and disease susceptibility. Despite its relevance, however, statistical modeling of expression quantitative trait loci (eQTL) has not received the attention it deserves. Based on two reasonable assertions (i) a good model should consider all available variables as potential effects, and (ii) gene expressions are highly interconnected, we suggest that an eQTL model should consider the rest of expression levels as potential regressors, in addition to the markers.
Results
It is shown that power can be increased with this strategy. We also show, using classical statistical and support vector machines techniques in a reanalysis of public data, that the external transcripts, i.e., transcripts other than the one being analysed, explain on average much more variability than the markers themselves. The presence of eQTL hotspots is reassessed in the light of these results.
Conclusion
Model choice is a critical yet neglected issue in genetical genomics studies. Although we are far from having a general strategy for model choice in this area, we can at least propose that any transcript level is scanned not only for the markers genotyped but also for the rest of gene expression levels. Some sort of stepwise regression strategy can be used to select the final model.
Background
Genetical genomics is currently a very active area of research, promising to improve dramatically our knowledge on the genetic architecture of complex traits, including disease susceptibility. Its goal is to identify the polymorphisms responsible for the variation in gene expression levels and thus to improve our understanding of how gene networks are organised in an organism. Thus far, genetical genomics experiments have been analysed considering each expression level one at a time and using fairly simple statistical models, correcting only, e.g., by sex. As a consequence, the results are a collection of successive quantitative trait loci analysis (eQTL in the terminology introduced by Schadt et al. [1, 2]), where each gene expression level is analysed independently. It is surprising that much effort has been dedicated to issues like data normalization [3] or computing efficiency [4] whereas modelling the trait itself (i.e., the expression level) has been severely neglected.
The proposed approach provides new insight into genetical genomics data, as we argue later, but first the researcher should be aware of the interpretation of the different strategies. In the usual modelling strategy, when only markers are fitted to explain the expression level, one is interested in picking up markers associated to the trait regardless of other effects. If markers and additional correlated expression levels are included in the model, one will select only those markers that are conditionally associated to the phenotype of interest, that is, after removing the direct effect of external expression levels on the phenotype. The rationale for this is to avoid confounding and improve power by reducing environmental noise. In practice, the issue of what covariables to include can be a difficult choice. Suppose that we are analyzing the effect of some polymorphisms on a phenotype in human populations. Suppose also that this phenotype is affected by sex. If one includes height as well as sex in the model, and because height and sex are correlated, it might be that a 'true' marker effect is attenuated. Here, fitting the marker within each sex class (i.e., fitting an interaction sex × marker) could improve model performance.
A relevant question in our approach is: what is the relative importance of each set of variables, discrete (markers) vs. continuous (transcript levels). This question is equivalent to disentangling the relevance of genetics vs. environment because transcript levels are part of the 'environment'. There are two broad approaches in the literature to measure the relevance of variables based, respectively, on statistical and on artificial intelligence techniques. The former is based on fitting two competing models, with and without the variable of interest; a typical measure of variable importance is the Pvalue obtained in, say, a likelihood ratio test. The Pvalue measures essentially the probability of having obtained the data when the null model is true, the smaller this probability is, the less likely the null model holds. In contrast to statistical methodologies, artificial intelligence techniques are not too popular in genetics yet. Among the pleiade of artificial intelligence techniques, support vector machines (SVM) have emerged as one of the most reliable and efficient methodologies. SVM are a powerful family of algorithms for learning classification and regression tasks [5]. They are based on the minimization of the structural risk of errors by means of wellknown and wellfounded techniques of quadratic programming. Using the socalled kernel trick, SVM can learn linear and nonlinear functions from either continuous or discrete variables. An important feature of SVM is that they can successfully handle datasets with a small number of observations and thousands of independent variables, the so called 'large p small n paradigm', which make them an attractive tool for microarray data [6–8] or for information retrieval where each document is described by a vector with as many indexes as possible words [9]. Importantly, SVM can be endowed with algorithms [10, 11] that provide an ordered list of variables according to their relevance in a prediction task. However, and in contrast to classical statistical methods, the relevance of each individual variable can not be quantified with these algorithms.
In this work we explore the consequences of model choice in eQTL studies reanalyzing public data with maximum likelihood and SVM techniques. We show that, in fact, most of the variation observed in gene expression is largely explained by external expression levels, while the influence of polymorphic markers (the eQTL itself) is limited. We show that this can have a dramatic influence on the results obtained. Figure 1 illustrates the point made in this article.
Results and discussion
Variable relevance
The opposite phenomenon was observed with the transcript level of Lin7c, one of the most highly connected gene in the brain [12]. In this case, the original eQTL had a nominal Pvalue ~10^{5} with the usual model (model 1). This significance decreased when Sacm1l expression level was included using model (2) (P = 2 × 10^{3}). Sacm1l transcript level was the most significant factor associated with Lin7c expression (P < 10^{53}), i.e., much more significant than any marker. It should also be noted that the position of the maximum statistics was shifted, from marker D12Mit234 in chromosome 12 to D6Mit116 in chromosome 6. Interestingly, the expression level of Sacm1l had an eQTL also in the neighborhood of marker D12Mit234 with model 1. This likely occurs because both traits are highly correlated (ρ = 0.99). Schadt et al. [2] proposed to compare likelihoods P(x_{1}m) and P(x_{1}x_{2}), where x_{1} and x_{2} are cDNA measures and m, marker genotypes, in order to disentangle whether m or x_{2} are causal to x_{1}. Here, we compared P(x_{ Lin7c } x_{ Sacm1l }) vs. P(x_{ Sacm1l } x_{ Lin7c }) but they were almost identical and thus we cannot resolve whether one gene is causal to the other by using only statistical evidence, whereas P(x_{ Lin7c } D12Mit234) was slightly more significant (P = 10^{5}) than P(x_{ Sacm1l } D12Mit234), P = 10^{3}. All this hints that Lin7c and Sacm1l are mediated through the same causal effects but that it seems that their association with D12Mit234 could be actually a false positive because it is far less significant than other eQTL found in this same study.
Associated Pvalues and AUC for some of the most significant QTL reported by Chesler et al.(2005)
Best marker^{a}  Best transcript^{b}  AUC50(%) ^{c}  

Name  log10 Pvalue^{d}  AUC%  Name  log 10 Pvalue^{d}  AUC%  Marker  Transcript  All  
TransQTL  
Mela  D9Mit196  23  72  Cap1  6  65  76  88  93 
Myoc  D2Mit237  14  71  Pam  7  66  77  88  91 
Cd59a  D13Mit11  12  72  A08Rik  11  57  74  89  89 
Myh9  D19Mit35  5  61  Igfbp5  18  91  71  89  91 
Pitpnb  S14Gnf055.010  6  63  Rab7  20  77  71  88  88 
CisQTL  
Prdx2  S08Gnf094.275  15  71  Psma7  20  74  69  94  94 
Kcnj9  D9Mit11  7  62  Tnp1  24  77  62  95  96 
Krt112  D11Mit58  13  70  Mrps7  10  71  73  89  89 
Ntan1  D12Nyu7  5  57  K22Rik  39  80  63  84  85 
Mrpl48  D7Mit17  11  73  Mcee  23  67  77  90  91 
Largestclique  
Lin7c  D12Mit84  5  62  Sacm1l  53  90  80  97  97 
The above considerations should not imply that the most significant variable is always an external cDNA level for all genes, and thus that model (2) is to be preferred over model (1). We observed that usual model (1) was to be preferred in about 25% of the most significant QTL listed by Chesler et al[12]. In Table 1, the most significant variable was a marker rather than a transcript level in four out of ten genes (Mela, Myoc, Cd59a, and Krtl12). We investigated whether modeling can nevertheless be improved in these cases. To do that, we searched the next most significant effect among all external cDNAs and the rest of markers, computing its Pvalue after fitting the QTL. We observed that the next most significant variable was a transcript which in turn was highly significant (Pvalues ranged 10^{8} – 10^{30}). Note that this is surely an underestimation of the influence because we were considering those expression levels for which a marker (QTL) is the most significant effect. The conclusion that we can draw, again, is that external cDNAs are likely to be very important factors to be considered in genetical genomics studies.
Reanalyzing hotspots
The observation of eQTL hotspots, i.e., genome regions that seem to harbour a much higher number of QTL than expected by chance has been largely debated in the literature [1, 13–15]. This is a remarkable observation and is tempting to look for a functional significance to these regions. It suggests the presence of key regulatory, polymorphic motifs in the genome that can have a profound influence on the genome transcription activity. In a previous simulation study we observed that QTL hotspots appeared even when genotypes and microarrays were shuffled, simply as a consequence of the high correlation that exists between many expression levels [16]. It is important to notice, though, that we can use the correlation between cDNA levels to our advantage in order to improve eQTL modelling dramatically.
QTL results for a subset of genes pertaining to a QTL hotspot localised around marker D6Mit254 (chr. 6).
Gene  log10 Pvalue of QTL (model 1) ^{a,b}  Best transcript  Pvalue of best transcript^{a}  Position of QTL (model 2)^{c}  log10 Pvalue of QTL (model 2) ^{a,d} 

Reln  4.4  0610039D01Rik  30.0  S05Gnf018.190  21.0 
Chrng  4.1  Gpx1  32.3  D6Mit254  5.6 
Slc6a1  3.7  Mad2l1  35.5  D6Mit254  5.9 
Calm4  3.6  Adprhl2  37.8  S17Gnf094.470  7.2 
Mapk6  3.5  Tusc2  41.7  D6Mit254  10.9 
Adra2b  3.3  Sox11  30.9  S04Gnf147.400  3.6 
Mapk1  3.3  1110011K10Rik  34.4  D6Mit254  6.2 
Gad1  3.0  Bzrp  45.3  D6Mit254  3.2 
Htr4  2.5  Hbbb2  45.5  D18Mit19  11.7 
Two aspects are important. First, the QTL were more significant now than with model (1); sometimes significance increased dramatically, e.g., Pvalue changed from 10^{4} to 10^{21} for gene Reln. This is a clear evidence that expression levels included in the model remove noise and thus may increase power for QTL detection, as we observed previously (Figure 2 top). We did not always observe this, in other instances we found that adjusting for transcript levels decreased QTL significance (Figure 2 bottom). In these latter cases we can conclude that the QTL found with the simple model is an artefact and that the QTL was truly affecting other gene expression level. The second noticeable aspect in Table 2 is that the QTL location was shifted for some transcripts (Reln, Calm4, Adra2b, and Htr4). Although we did not systematically search for all cDNA levels affected by this QTL, it is clear that the number of the genes in the hotspot has been reduced by ~50%. This is a challenging result and calls for revisiting the significance of eQTL hotspots, as they are highly dependent on the model used. From a purely statistical – rational – point of view, a model that includes a highly correlated transcript level is to be preferred over one that includes only the marker: Pvalues in the order of 10^{30} to 10^{45} vs. ~10^{4}.
Conclusion
Including external expression levels in the model can improve statistical inference, decreasing the rate of false positives and increasing power (e.g., Figure 2). This is a consequence of the biological fact that regulation of gene expression is highly interconnected, resulting in the complex intercorrelations that exist in microarray data. Seemingly, a high fraction of this correlation is purely environmental. In fact, as Figure 3 suggests, expression levels are far more important than markers in explaining the observed variability. In other words, the variation in a given expression level is more likely to be affected by the expression levels of related genes than directly caused by marker polymorphisms. Thus, we can argue that expression levels behave as noisy environmental factors, very much like say age, sex or batch in a regular statistical analysis. But in all likelihood, each expression level will behave differently and thus we will require specific models for each expression level. Thus, automated and efficient modeling strategies are badly needed if we are to exploit all information contained in genetical genomics studies.
In conclusion, model choice is a critical yet neglected issue in genetical genomics studies. Although we are far from having a general strategy for model choice in this area, we can at least propose that any transcript level is scanned not only for the markers genotyped but also for the rest of gene expression levels. Some sort of stepwise regression strategy can be used to select the final model. This will illuminate what a QTL hotspot is really made of and will improve our ability to reconstruct genetic networks from genetical genomics experiments.
Methods
Data
Chesler et al.'s [12] experiment consists of a set of 35 BxD mouse recombinant inbred lines. Brain tissue from 100 pools of individuals were arrayed with Affymetrix U74Av2 arrays chips and a panel of 779 markers was genotyped. Each array experiment was made up with a pool of brain tissue (excluding olfactory bulb, retina or neurohypohysis) from three individuals of the same sex. The data set was downloaded from the GeneNetwork site [17].
Maximum likelihood techniques
Two main models were used to reanalyze the data. In model (1), the jth expression level for ith individual (i = 1, n) is modelled as
${\text{y}}_{\text{ij}}=\mu +{\displaystyle \sum _{\text{t}}{\lambda}_{it}{\text{g}}_{\text{t}}}+{\epsilon}_{\text{ij}},\phantom{\rule{0.5em}{0ex}}\left(1\right)$
where μ is the general mean or any other fixed effects that may be included in the model, λ is a 0/1 indicator variable that identifies the genotype of the individual for the marker considered (i.e., λ_{it} = 1 if the ith individual has genotype t, 0 otherwise), and ε is the residual. Note that we assume that marker density is very high so that we test each individual marker in turn instead of carrying out a QTL scan. This is done here for simplicity and computational speed as is straight forward to generalize (1) to other situations. Carlborg et al. [18] found no large differences between single marker and interval mapping in this context. In model 2 we allow, in addition, that cDNAs other than the one analysed can be included in the model as covariates, i.e.,
${\text{y}}_{\text{ij}}=\mu +{\displaystyle \sum _{\text{t}}{\lambda}_{it}{\text{g}}_{\text{t}}}+{\displaystyle \sum _{\text{k}\ne \text{j}}{\delta}_{k}{\beta}_{k}{\text{y}}_{\text{ik}}+{\epsilon}_{\text{ij}}}\phantom{\rule{0.5em}{0ex}}\left(2\right)$
where δ_{k} is an indicator variable with value 1 if the cDNA level k is included in the model with covariate coefficient β_{k}, 0 otherwise. Note that a most difficult problem can be choosing the adequate set of δ's. For this work, we scanned all cDNAs and we included in (2) only the most significant cDNA. Models (2) vs. (1) were compared with a likelihood ratio test, and Pvalues were computed assuming the usual Chisquared approximation. Likelihood was maximized using an EM algorithm implemented in package Qxpak [19].
Support vector machines techniques
SVM techniques are a well known tool for classification and prediction [20]. The rationale for using SVM in this context was to use an alternative to maximum likelihood to identify the variables that best predict the trait (expression level) of interest. As in the previous section, suppose we have an ndimensional real vector containing the expression level to be studied (y), and a collection of ddimensional real vectors x_{i} that contains the descriptive variables (i.e, all markers and the rest of cDNAs). The goal of SVM is to produce a predictive function for the expressions levels of each individual. Therefore, the input of a SVM can be collected in a set of pairs S = {(x_{1}', y_{1}),...,(x_{n}', y_{n})}, while the output is a vector w* and a scalar b* such that the function h defined by
h(x) = w*' x + b* (3)
is a prediction of the expression level y of an individual described by x. An important issue is to fix the criterion for measuring the quality of the prediction. In our case, the aim is to produce a function h (Eq. 3) such that the relative ordering of (h(x_{1}),...,h(x_{m})) is as close as possible with the observed ordering of (y_{1},...,y_{n}). For this purpose we used the loss function [21] that returns the number of pairs (i, j) whose predicted relative ordering (h(x_{i}), h(x_{j})) is swapped with respect to its observed ordering of (y_{i}, y_{j}). Formally, the loss of h in the set is defined as the probability
${\Delta}_{\text{SP}}\text{(h,S)}=\text{P}\left(\text{h(}{x}_{\text{i}}\text{)}\le \text{h(}{x}_{\text{j}}{\text{)y}}_{\text{i}}>{\text{y}}_{\text{j}}\right)=\frac{{\displaystyle {\sum}_{{\text{i,j:y}}_{\text{i}}>{\text{y}}_{\text{j}}}\text{I{h(}{x}_{\text{i}}\text{)}\le \text{h(}{x}_{\text{j}}\text{)}}}}{{\displaystyle {\sum}_{\text{i,j}}{\text{I{y}}_{\text{i}}>{\text{y}}_{\text{j}}\text{}}}}\phantom{\rule{0.5em}{0ex}}\left(4\right)$
where I{p(x)} is the function that returns 1 when the predicate p(x) is true, 0 otherwise. This loss function can be seen as a generalization of the complement of the Area Under a Receiver Operating Characteristic (ROC) curve, AUC for short. Hanley and McNeil [22] showed that the AUC is the probability of a correct ranking and thus AUC coincides with the value of the WilcoxonMannWhitney non parametric statistic. Here, we report
AUC = 100(1  Δ_{SP}(h, T)) (5)
as a measure of the goodness of the SVM prediction models. Technically, the SVM seeks w* and b* as the solution to the following convex quadratic optimization problem,
$\text{min}\frac{1}{2}w\text{'}w+C{\displaystyle \sum _{i=1}^{n}\left({\xi}_{i}^{}{\xi}_{i}^{+}\right)}\phantom{\rule{0.5em}{0ex}}\left(6\right)$
subject to
(w'x_{ i }+ b)  y_{ i }≤ ∈ + ${\xi}_{i}^{+}$ and y_{ i } (w'x_{i} + b) ≤ ∈ + ${\xi}_{i}^{}$,
where C is the regularization parameter and ξ are slack variables (ξ^{+}, ξ^{} ≥ 0). Notice that the regression approach is similar to fitting the rank, although the function optimised by regression is not exactly a measure of the coherence between observed and predicted rankings. In fact, we could have chosen a SVM solution where the goal was to optimise the loss function AUC (4) directly, see e.g. Joachims [23, 24]. However, the results achieved with regression were good enough and they are faster to obtain.
We used SVM^{light} software [9] to produce regressors that were evaluated with a 10 fold cross validation using the AUC loss function. This means that the whole dataset was randomly split into 10 partitions, each resulting in a training subset and a test subset. Equation (6) is used by SVM to produce a function h (Eq. 3) using the training subset, whereas the function h is evaluated (Eqns. 4 and 5) using the test subset. The performance estimation returned by the crossvalidation method is the mean over all 10 partitions. The kernel used was linear, C was set to 1 (usually the default value in most SVM environments), and the parameter ∈ was set to 0.01, the default value in the implementation used.
The markers were dealt as discrete variables with each of the three values (the three genotypes) transformed into three Boolean attributes. Thus, when a marker was found to be among the 50 most relevant for a given trait, the three associated Boolean variables were included in the corresponding model. The algorithm used to discover relevancies was the socalled Recursive Feature Elimination (RFE) [10]; a simple yet efficient method when the kernel is linear. We run SVM for several cDNA levels considering as predictors either all variables (cDNAs and markers), only markers or only cDNA levels. We set a maximum of 50 variables to be included in the decission rule h.
Abbreviations
 AUC:

Area under a receiver operating characteristic curve
 QTL (eQTL):

(expression) quantitative trait locus
 RFE:

Recursive Feature Elinitaion
 SVM:

support vector machine.
Declarations
Acknowledgements
We thank the authors Chesler and colleagues, especially Rob W. Williams, for making their data publicly available. We are also grateful to the referees for their suggestions. Work funded by grants AGL20040103, GEN200320658 and TIN200508288 (Ministry of Education, Spain).
Authors’ Affiliations
References
 Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G, Linsley PS, Mao M, Stoughton RB, Friend SH: Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003, 422 (6929): 297302. 10.1038/nature01434.PubMedView ArticleGoogle Scholar
 Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H, Kash SF, Drake TA, Sachs A, Lusis AJ: An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet. 2005, 37 (7): 710717. 10.1038/ng1589.PubMed CentralPubMedView ArticleGoogle Scholar
 Williams RB, Cotsapas CJ, Cowley MJ, Chan E, Nott DJ, Little PF: Normalization procedures and detection of linkage signal in geneticalgenomics experiments. Nat Genet. 2006, 38 (8): 855856. 10.1038/ng0806855.PubMedView ArticleGoogle Scholar
 Storey JD, Akey JM, Kruglyak L: Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol. 2005, 3 (8): e26710.1371/journal.pbio.0030267.PubMed CentralPubMedView ArticleGoogle Scholar
 Vapnik V: Statistical learning theory. 1988, John WileyGoogle Scholar
 Chu F, Wang L: Applications of support vector machines to cancer classification with microarray data. Int J Neural Syst. 2005, 15 (6): 475484. 10.1142/S0129065705000396.PubMedView ArticleGoogle Scholar
 Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000, 16 (10): 906914. 10.1093/bioinformatics/16.10.906.PubMedView ArticleGoogle Scholar
 West M: Large p, small n paradigm. Bayesian statistics 7: Proc 7th Valencia International Metting. Edited by: Bernardo JM. 2003, Oxford , Clarendon Press, 723732.Google Scholar
 Joachims T: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proc 10th Eur Conf Machine Learning (ECML). 1998, [http://svmlight.joachims.org/]Google Scholar
 Guyon I. WJ Barnhill S., Vapnik V.: Gene selection for cancer classification using support vector machines. Machine Learning. 2002, 46: 359422. 10.1023/A:1012487302797.View ArticleGoogle Scholar
 Bahamonde A, Bayón GF, Díez J, Quevedo JR, Luaces O, del Coz JJ, Alonso J, Goyache F: Feature subset selection for learning preferences: a case study. Proc of the 21st Int Conf Machine Learning, ICML. 2004, 4956.Google Scholar
 Chesler EJ, Lu L, Shou S, Qu Y, Gu J, Wang J, Hsu HC, Mountz JD, Baldwin NE, Langston MA, Threadgill DW, Manly KF, Williams RW: Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37 (3): 233242. 10.1038/ng1518.PubMedView ArticleGoogle Scholar
 Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic Dissection of Transcriptional Regulation in Budding Yeast. Science. 2002, 296 (5568): 752755. 10.1126/science.1069516.PubMedView ArticleGoogle Scholar
 Darvasi A: Genomics: Gene expression meets genetics. Nature. 2003, 422 (6929): 269270. 10.1038/422269a.PubMedView ArticleGoogle Scholar
 de Koning DJ, Haley CS: Genetical genomics in humans and model organisms. Trends in Genetics. 2005, 21 (7): 377381. 10.1016/j.tig.2005.05.004.PubMedView ArticleGoogle Scholar
 PerezEnciso M: In silico study of transcriptome genetic variation in outbred populations. Genetics. 2004, 166 (1): 547554. 10.1534/genetics.166.1.547.PubMed CentralPubMedView ArticleGoogle Scholar
 GeneNetwork: www.genenetwork.org.
 Carlborg O, De Koning DJ, Manly KF, Chesler E, Williams RW, Haley CS: Methodological aspects of the genetic dissection of gene expression. Bioinformatics. 2005, 21 (10): 23832393. 10.1093/bioinformatics/bti241.PubMedView ArticleGoogle Scholar
 PérezEnciso M, Misztal I: Qxpak: a versatile mixed model application for genetical genomics and QTL analyses. Bioinformatics. 2004, 20 (16): 27922798. 10.1093/bioinformatics/bth331.PubMedView ArticleGoogle Scholar
 Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning. 2001, New York , Springer VerlagView ArticleGoogle Scholar
 Herbrich R, Graepel T, Obermayer K: Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers,. Edited by: A.J. Smola PLBBSDS. 2000, Cambridge , MIT Press, 115132.Google Scholar
 Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982, 143: 2936.PubMedView ArticleGoogle Scholar
 Joachims J: A support vector method for multivariate performance measures. Proc 22nd Int Conf Machine Learning (ICML). 2005Google Scholar
 Joachims T: Training Linear SVMs in Linear Time. Proc 20th ACM Conf Knowledge Discovery and Data Mining (KDD). 2006Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.