 Research
 Open Access
 Published:
Contextbased preprocessing of molecular docking data
BMC Genomics volume 14, Article number: S6 (2013)
Abstract
Background
Data preprocessing is a major step in data mining. In data preprocessing, several known techniques can be applied, or new ones developed, to improve data quality such that the mining results become more accurate and intelligible. Bioinformatics is one area with a high demand for generation of comprehensive models from large datasets. In this article, we propose a contextbased data preprocessing approach to mine data from molecular docking simulation results. The test cases used a fullyflexible receptor (FFR) model of Mycobacterium tuberculosis InhA enzyme (FFR_InhA) and four different ligands.
Results
We generated an initial set of attributes as well as their respective instances. To improve this initial set, we applied two selection strategies. The first was based on our contextbased approach while the second used the CFS (Correlationbased Feature Selection) machine learning algorithm. Additionally, we produced an extra dataset containing features selected by combining our context strategy and the CFS algorithm. To demonstrate the effectiveness of the proposed method, we evaluated its performance based on various predictive (RMSE, MAE, Correlation, and Nodes) and context (Precision, Recall and FScore) measures.
Conclusions
Statistical analysis of the results shows that the proposed contextbased data preprocessing approach significantly improves predictive and context measures and outperforms the CFS algorithm. Contextbased data preprocessing improves mining results by producing superior interpretable models, which makes it wellsuited for practical applications in molecular docking simulations using FFR models.
Background
Data preprocessing is a major step in data mining. Although timeconsuming, it improves data quality so they can be properly mined, thus producing more accurate, interpretable, and applicable models. Many techniques can be applied to data preprocessing [1], including data cleaning, data integration, and data transformation. In predictive machine learning problems, there is an input x and an output y; the task is to learn how to map the input to the output. Such a mapping can be defined as a function y = g(xθ) where g(.) is the model and θ its parameters [2].
Although we can find numerous algorithms for prediction, many of them only work by producing a predictive function that indicates to which target value the objects belong. However, in some data mining problems, it is necessary to have a better comprehension of the induced models. Decision trees are models well understood by users. Indeed, Freitas et al. [3] support the use of decision trees models, instead of black box algorithms, to represent, graphically, patterns revealed by data mining, for example, Support Vector Machine (SVM) or Neural Networks models. Still according to these authors [3], the hierarchical structure developed can emphasize the importance of the attributes used for prediction.
The incorporation of contextaware data preprocessing to improve mining results is an active area of research. Baralis et al. [4] develop the CASMine: a contextbased framework to extract generalized association rules, providing a highlevel abstraction of both, user habits and service characteristics, depending on the context. Nam et al. [5] discuss how the context can help classify the face image. Although these authors discuss the importance of considering the context in data mining applications while they develop their work according to a contextaware definition, the context involved is intrinsically specific to each working background. Hence, their methodologies are not suitable to the molecular docking simulations context explored in this work.
There are many areas of application where a comprehensible model is fundamental to its proper use. In bioinformatics, only a set of data and a set of data mining models may not be enough. The data and the results must represent the context in which they are embedded. Bioinformatics is a clear example of where we believe data preprocessing is instrumental. Our contribution is within the context of rational drug design (RDD). The interactions between biological macromolecules, called receptors, and small molecules, called ligands, constitute the fundamental principle of RDD. Insilico molecular docking simulations, an important phase of RDD, investigate the best bind pose and conformation of a ligand into a receptor. The best ligands are tested by invitro and/or invivo experiments. If the results are promising, a new drug candidate can be produced [6]
A proper data preprocessing may induce decisiontrees models that are able to identify important features of the receptorligand interactions from molecular docking simulations. In the present work, we propose and apply a predictive regression decisiontree on the contextbased preprocessed data from docking results and show that bioinformaticians can easily understand, explore, and apply the induced models. We apply four preprocessing techniques. Firstly, we produce and arrange all attributes based on the domain knowledge. Secondly, still based on a context domain, we improve the input by selecting two appropriate features. Thirdly, we apply a conventional machine learning feature selection to the initial set of attributes. Finally, we combine the feature selection generated using the first and second strategies with those from the third strategy. We assess the results for the model's accuracy and interpretability. Then, we demonstrate how a careful and valueadded data preprocessing can produce more effective models.
Methods
The molecular docking context
Interaction between drug candidates (ligands) and target proteins (receptors), through molecular docking simulations, is the computational basis of RDD. Given a receptor, molecular docking simulations sample a large number of orientations and conformations of a ligand inside its biding site. The simulations also evaluate the Free Energy of Binding (FEB) and rank the orientations/conformations according to their FEB scores[7]. The majority of molecular docking algorithms only consider the ligand as flexible, whereas the receptor remains rigid, due to the computational cost involved in considering the receptor's explicit flexibility. However, biological macromolecules, like protein receptors, are intrinsically flexible in their cellular environment. The receptor may modify its shape upon ligand binding, moulding itself to be complementary to the ligand [8]. This increases favourable contacts and reduces adverse interactions, which, in turn, minimizes the total FEB [9]. Therefore, it is important to consider the receptor's explicit flexibility in molecular docking simulations.
In this work, we model the full receptor explicit flexibility in the molecular docking simulations [10] using a set of different conformations for the receptor, generated by molecular dynamics (MD) simulations [11]. This type of representation, named a fullyflexible receptor (FFR) model [10], results in the need of executing large numbers of docking simulations and voluminous results to be analysed. Actually, one of the current major challenges in bioinformatics is how to handle large amounts of data [12], or big data [13].
Data modelling and acquisition
The InhA enzyme from Mycobacterium tuberculosis (Mtb) [14] is the target protein receptor in this work. It contains 268 amino acid residues and 4,008 atoms. The 3D structure (PDB ID: 1ENY) of the crystal, rigid receptor [14], was retrieved from the Protein Data Bank [15]. The FFR model of InhA (FFR_InhA) contains 3,100 snapshots from a 3.1 ns MD simulation [11]. Machado et al. [10] performed molecular docking simulations of FFR_InhA against each of the four different ligands: TCL [16], PIF [17], ETH [18] and NADH [14].
All docking results and snapshots of the FFR_InhA model were stored into a proper repository [19]. We developed this repository to integrate FFR models and docking results, allowing users to query the database from different points of view [20]. In fact, queries can traverse relationships between receptors and ligands' atoms and viceversa, including their conformations and 3D coordinates. This repository enables us to produce effective inputs to use in different data mining tasks with their corresponding algorithms.
Attributes arrangements
A major objective of this work is to reduce the number of snapshots used as input in docking simulations of a FFR model against a given ligand. In this sense, by mining the data from the FFR model and its docking results, we expect to select a subset of all available receptor conformations that are most relevant and capable of indicating whether a given ligand is a promising compound. Machado et al. [21][22] demonstrated how data mining can address this question. Winck et al. [23] obtained encouraging results by applying a contextbased preprocessing to data mining of biological text. Hence, we focus our efforts on contextbased data preprocessing. In our database [19] there are many available features. Choosing the most important ones impacts directly the choice of the proper data mining algorithm. Predictive data mining task is defined by the target attribute [24]. In the following sections we define the target and predictive attributes of the domainspecific knowledge of this work.
Target attribute definition
One way to evaluate a molecular docking simulation with AutoDock3.0.5 [25] is by examining the values of the resulting free energy of binding (FEB): the most negative FEB values generally indicate the best receptorligand binding affinity. AutoDock3.0.5 predicts the bound conformations of a ligand to a receptor. It combines an algorithm of conformation search with a rapid gridbased method of energy evaluation [25]. The AutoGrid module of AutoDock3.0.5 precalculates a 3D energybased grid of interactions for various atom types. Figure 1 shows an example of the grid box used in this work.
We adopt the FEB as our target attribute because it discriminates docking results. There is no consensus about what is the reasonable range of FEB values. Each ligand has to be considered and evaluated individually. Analysis of FEB values from the docking simulations of the FFR_InhA with the four ligands produced different ranges of minimum, maximum and average FEB values (Table 1).
Analysis of Table 1 shows that the difference between the lowest and highest values is very subtle. Although we have an absolute difference between these extreme values (for instance, for ETH it is 2.95 kcal/mol), there are many instances where the decimal value varies sometimes a difference between two FEB values, for instance for ETH, 6.71 and 6.03 can be significant. In previous work, Machado et al. [26][27] using the same four ligands, discretized the FEB values using three different procedures: by equal frequency, by equal width and an original method based on the mode and standard deviation of FEB values. The authors split the FEB into five classes: Excellent, Good, Regular, Bad, and Very Bad. This preprocessing step generated the input data upon which the J48 decision tree algorithm was executed. The resulting performance's measures showed that discretization by equal frequency is not satisfactory. That by equal width had good evaluation for two of the four ligands only [27]. In these cases, J48 did not generate legible trees. Discretization by the mode and standard deviation, however, had better performance's measures for two ligands and produced more legible decision trees for all four ligands[27]. Although the J48 algorithm produced encouraging results, we found it challenging to discretize FEB values whose differences were particularly small. For instance, it was difficult to decide if a FEB value of 8.10 kcal/mol is a Good or Regular FEB since the difference to the next FEB value was 0.10 kcal/mol only. Because of the significance of the decimal values we may have an important loss of information when applying this discretization to FEB values. Therefore, the FEB value is taken as real values, which implies the use of a regression predictive task of data mining.
Predictive attributes definition
According to Jeffrey [28] and da Silveira et al. [29] meaningful contact between two atoms can be established on a distance as large as 4.0 Å. In molecular docking, the FEB value is dependent on the shortest distance between atoms of the receptor's residues and ligands. This is because receptorligand atoms' pairs within 4.0 Å engage in favourable hydrogen bonds (HB) and hydrophobic contacts (HP) [28]. Hence, for each receptor (R) residue, we calculate the Euclidean distance (ED) between their atoms and the atoms of the ligand (L). We define min(Dist_{ R,L }) as the predictive attribute representing the shortest distance between the ligand and the receptor's residues. Thus, min(Dist_{ R,L }) with a 4.0 Å threshold indicates the presence of receptorligand favourable contacts (HBs and HPs). Only min(Dist_{ R,L }) is recovered from the repository [19]. If we used all receptorligand distances the input file would have an enormous amount of attributes, for example, for the PIF ligand which has 24 atoms, the entry would have more than 96,000 attributes! This number of predictive attributes would generate model trees with huge amounts of nodes, and, therefore, would not be interpretable. Each of the 3,100 snapshots of the FFR_InhA will have 268 attributes. We repeat the same procedure for all four ligands. In the end, we have one preprocessed input for each of the four ligands.
Data preprocessing strategies
Our database does store the FFR_InhA which contains 3,100 snapshots (Sn), each with 4,008 atoms (AtR). It totalizes Sn × AtR = 12, 242, 800 receptor coordinates (CoordR). Because each docking simulation is made of 10 runs, we obtain 31,000 docking results for each ligand. However, some docking simulations runs did not converge or had positive FEB values. It occurs when the number of runs and the number of cycles defined as parameter to the algorithm are not enough to find a good position to bind the ligand into the receptor. The docking simulations were performed using the Simulated Annealing (SA) algorithm, which makes its conformation exploration using the Monte Carlo approach. Since in each step of execution a random movement is applied inside the binding site, sometimes the ligand keeps in a nonfavourable position during the number of runs established. If it happens during many runs, the docking result does not converge, that is, it does not present any interaction position/energy in the end of the execution of a given experiment. We considered these data as outliers and did not include them in the preprocessing step. We also defined the parameter ValDoc as the total number of valid docking simulations per ligand. Since AtLig is the number of atoms of each ligand, the sum of the product AtLig × ValDoc for all four ligands determines the total number of ligand coordinates (LigCoord). In summary, we have:

CoordR = 3, 100 × 4, 008 = 12, 424, 800 records

LigCoord_{ NADH }= 52 × 11, 284 = 586, 768 records

LigCoord_{ TCL }= 18 × 28, 370 = 510, 660 records

LigCoord_{ PIF }= 24 × 30, 420 = 730, 080 records

LigCoord_{ ETH }= 13 × 30, 430 = 395, 590 records

LigCoord = 586, 768 + 510, 660 + 730, 080 + 395, 590 = 2, 223, 098 records
Data generation
To generate an initial dataset we need to combine the 12,424,098 CoordR and the 2,223,098 LigCoord, calculate their interactions, and find their respective min(Dist_{ R,L }). For that, we developed the Dataset algorithm. It executes the first preprocessing step by handling the input data and by producing the best receptorligand interactions stored in an output file: the [Input] matrix. [Input] contains ValDoc lines and 269 columns. The first 268 columns contain the 268 receptor residues min(Dist_{ R,L }). To generate a proper dataset for data mining, we aggregated a target attribute in the last column, which is the corresponding FEB value. It is important to emphasize that, at this stage, min(Dist_{ R,L }) can have any positive value.
 Dataset Algorithm 
Let R be a receptor
Let L be a ligand
Let t be a snapshot of R
Let r be a residue of R
Let a be an atom in t snapshot
Let l be an atom in L
Let Dist be the distance between L and R atoms in t
Let DistanceMatrix be a matrix where each line corresponds to a residue r and each cell corresponds to the distance between a and l
Let Result be a matrix that stores for each t snapshot, all minimum distances between a and l
Let Input be a matrix containing Result and, for each t , its respective FEB value
FOR each t in Total_Snapshots _{ R }
[ Result ]_{ * } ← null
FOR each r in Total_Residues_{ R }
[ DistanceMatrix_{ *,* }] ← null
FOR each a in Total_Atoms_Residue_Snapshot_{ R,t }
FOR each l in Total_Atoms_Ligand_{ L }
Dis{t}_{Ri,Li}\leftarrow \sqrt{{\left({x}_{R}{x}_{L}\right)}^{2}+{\left({y}_{R}{y}_{L}\right)}^{2}+{\left({z}_{R}{z}_{L}\right)}^{2}}
Dist_{ Ra,Ll } ← ED ( R, L )
[ DistanceMatrix_{ a,l }] ← Dist_{ Ra,Ll }
ENDFOR
ENDFOR
[ Result_{ t,r }] ← min ([ DistanceMatrix_{ r,* }])
ENDFOR
[ Input_{ t,* }] ← [ Result_{ t,* } + FEB_{ L }]
ENDFOR
Dataset improvement
The initial dataset generated by the Dataset Algorithm contains 268 predictive attributes and a target attribute. To help improve the models induced by the data mining task, we must reduce further the amount of features. Jeffrey [28] states that the largest distance value that allows a meaningful contact between receptor and ligand atoms is 4.0 Å. The feature selection strategy in Dataset Algorithm includes distances higher than 4.0 Å. This means that the corresponding receptor residue does not establish a favourable contact with any of the ligand atoms [29]. If there is not a contact in any docking results, it is improbable that this attribute can adequately predict the FEB value. Therefore, we removed all attributes (residues) with shortest distances above the 4.0 Å threshold. ContextFS Algorithm generates a new input from the [Input] matrix output produced by Dataset Algorithm. To compare our contextbased feature selection with a wellknown machine learning feature selection algorithm, we generated one more dataset seeking to improve the initial one produced by the Dataset Algorithm. We believe that a subset of representative attributes can improve further the mining results.
 ContextFS Algorithm 
Let R be a receptor
Let t be a snapshot of R
Let r be a residue of R
Let Input be a produced by the Dataset Algorithm
Let InputFS be a result after our contextbased feature selection
FOR each r in Total_Residues _{ R }
IF min ([ Input_{ *,r }]) ≤ 4
FOR each t in Input
[ InputFS_{ t,r }] ← [ Input_{ t,r }]
ENDFOR
ENDIF
ENDFOR
FOR each t in InputFS
[ InputFS_{ t,* }] ← [ Input_{ t,r+1 }]
ENDFOR
Only a limited number of the existing feature selection algorithms can work effectively on regression predictive tasks. Among these, the Correlationbased Feature Selection (CFS) [30] algorithm implemented in Weka [24] can perform feature selection on our datasets. Therefore, we applied CFS to each input generated by Dataset Algorithm, with a different input for each of the four ligands. CFS is based on a filter approach that ranks features according to a correlationbased heuristic evaluation function 1. It looks for a subset that contains features uncorrelated with each other, but highly correlated with the target attribute.
where: M_{ S } is a heuristic of a subset S that contains k features; barr_{ cf } is the mean featuretarget correlation (f ∈ S) and {r}_{ff}^{} is the average featurefeature intercorrelation. Equation 1 forms the core of CSF [30]. Table 2 shows the number of attributes selected after applying our feature selection methodology to the original dataset. Additionally, we generated one more dataset (Table 2, fourth column) which combines the features selected by the ContextFS Algorithm with those selected by CFS [30].
Mining and evaluation the preprocessed data
Regression is a data mining task suitable to problems for which the attribute to be predicted is continuous. Since our target attribute is numeric, regression is the technique applied to the mining experiments in this study. Our models must be understandable and must also represent well the context in which they are inserted. Decision trees are algorithms that cover these needs and also can be applied to both classification and regression problems. The results are regression or classification models arranged in a tree structure. Decision trees can be applied to predict both continuous and discrete values. For continuous values, there are two main types of trees: regression trees and model trees. In regression trees, each leaf node stores a continuousvalued prediction, which is the average of the target attribute for the training tuples. In model trees, each leaf stores a regression model called Linear Model (LM), which is a multivariate linear equation for the target attribute [1]. Our goal is to induce models that indicate residues distances to predict a given FEB value. We expect our model to help us discover whether a snapshot, when docked to a given ligand, will lead to favourable estimated FEB values. For this, we use the M5P [31] machine learning model tree algorithm.
Evaluation of the induced models
There are several measures to verify if the induced models generated are acceptable numerical predictions. They are called predictive measures. In the case of model tree algorithms, the most widespread measures are: root meansquared error (RMSE, equation 2), mean absolute error (MAE, equation 3), and correlation coefficient [24]. Smaller values of RMSE and MAE are indicators of better models. All of these measures make use of the predicted values p_{1} . . . p_{ n } and the actual values a_{1} . . . a_{ n }.
The correlation coefficient (Equation 4) measures the statistical correlation between a and p. The values range from 1, for perfectly correlated results, to 0, when there is no correlation, and to 1, for an inverse perfect correlation. We look for perfectly correlated results or correlation coefficients closer to 1.
Where {S}_{PA}=\frac{{\sum}_{i}\left({p}_{i}\overline{p}\right)\left({a}_{i}\overline{a}\right)}{n1},{S}_{p}=\frac{{\sum}_{i}{\left({p}_{i}\overline{p}\right)}^{2}}{n1}, and {S}_{a}=\frac{{\sum}_{i}{\left({a}_{i}\overline{a}\right)}^{2}}{n1}, being that ā and \stackrel{\u0304}{p} are the corresponding a and p averages.
In addition to these measures, some investigations also make use of the model interpretability metric, which is the number of nodes in the model tree. The model tree with the smallest amount of nodes generates the best interpretable models [32].
Evaluation based on the context
The measures shown in the previous section were used during the evaluation of the models generated. However, as we are interested in the usefulness of the induced models, we propose a new contextbased measure. We also analyze the induced model trees and their contents. Figure 2 shows a model tree generated upon application of our contextbased preprocessing (ContextFS Algorithm) to NADH. This model contains five nonleaf nodes, each representing a selected amino acid residue, and six LMs. Equation 5 depicts the sixth LM (LM6) composed of a selected number of predictive attributes (receptor residues) weighted by their effect in the target attribute (FEB) plus a constant value.
We evaluate the models taking into account the receptor residues present in both the nonleaf nodes and the LMs, bearing in mind that the docking software calculates the FEB value only for the residues within the grid box around the receptor binding site (Figure 1). Consequently, if we are inducing model trees to predict FEB values, models that consider residues located outside the grid box have no direct significance. Usually, a specialist defines which residues belong to the receptor active site. These residues shape the active site for the complementary ligand binding. For InhA, the specialist selected 52 residues, here denoted by ESR. Subsequently, by inspecting each model, we identified which model's residues (MR) appear in the tree or the LMs (Figure 2 and Equation 5). Now we are able to evaluate MR and compare it with ESR by calculating the Precision (Equation 6), Recall (Equation 7) and Fscore (Equation 8) measures [1].
In the context of this analysis:

{Relevant} ∩ {Retrieved} can be defined as ESR ∩ MR

{Relevant} can be defined as ESR

{Retrieved} can be defined as MR
Results
We evaluated the models by means of the predictive and context measures presented. The measures were applied separately to each of the four distinct ligands; NADH, PIF, TCL, and ETH. For each one of them, we observed the four data preprocessing strategies:

1.
The results obtained by the initial dataset, generated by Dataset Algorithm;

2.
The results obtained by the contextbased feature selection, generated by ContextFS Algorithm;

3.
The results obtained by the feature selection generated by CFS [30];

4.
The results obtained by combining both feature selection generated by ContextFS Algorithm and CFS (Table 2 fourth column).
The initial dataset was the first and, possibly, the most important contextbased data preprocessing. Without the previous knowledge about the context, it would not be possible to generate an input that produces interpretable models as we expected. Based on the fact that the initial dataset was constructed considering minimum distances (min(Dist_{ R,L })), our hypothesis is that the contextbased data preprocessing we proposed, including feature selection, produces better results than using a worthy feature selection approach, where the context is not observed. Hence, we expected that the results from the third strategy would not be better than the others. On the other hand, we expected the contextbased feature selection (second strategy) to give better results than the others. The second strategy was applied considering both, the context already employed in the initial dataset and context to select appropriate features. To evaluate the results in terms of their statistical significance, we applied the Friedman Test [33] with a significance level of α = 0.05. For the context measures (Table 3), we evaluated whether strategy 2 was significantly better than the others. In this case, we could assert that it was true since p = 0.014. We infer that our feature selection approach improves the initial results. Therefore, as we are interested in the quality of the induced models, our contextbased measure can be considered as the most appropriate.
In Table 3 we evaluate whether strategy 3 is significantly worse than the others. We got p = 0.040 for MAE and p = 0.054 for RMSE as significance levels, indicating that probably strategy 3 is really worse than strategies 1, 2 and 4. However many effort is needed to assess it. With respect of context measures (Table 4), we evaluate whether strategy 2 is significantly better than the other ones. In this case, we can assert it is true because we got p = 0.014. We infer that our feature selection approach improves the initial results. In doing so, once we are interested in the quality of the induced models, our contextbased measure can be considered as the most appropriate.
It is noticeable in Tables 3 and 4 that the results are different for each ligand, despite employing the same strategy in the preprocessing. This is so because different ligands have different sizes, as well as different molecular interaction properties. They bind in different regions of the receptor's binding site. As a result, the target attribute FEB has different ranges of values for the distinct ligands (Table 1) and that is why the models are induced for individual ligands. Although they are not interchangeable, we expect them to be used to select ligands that belong to a similar class (high molecular similarity).
Conclusions
Data preprocessing is a significant step in data mining. In data preprocessing, different techniques are applied to improve data quality such that the mining results are more accurate and better interpretable. There are many techniques available to preprocess data, mainly for model quality measures. However, some applications, like bioinformatics, often demand wellsuited models. Hence, when the data mining process is based on the context involved, a contextbased preprocessing can improve the quality of the induced models.
In this article, we presented a case of mining data from flexible receptor molecular docking simulations results. Here the goal was to identify features that could characterize the best fit of ligands into a given receptor. Our experiments were conducted considering the InhA receptor from the M. Tuberculosis and four distinct ligands: NADH, PIF, TCL, and ETH. We showed that an appropriate contextbased data preprocessing could provide improved results.
We concentrated on four main preprocessing steps which: 1) consider the context to choose an initial set of attributes and the proper instances for each ligand input file; 2) perform feature selection on the initial dataset, taking into account the characteristics of the docking results from each ligand; 3) perform feature selection, for each ligand, based on the CFS machine learning algorithm; and 4) combine features selected by our contextbased approach (ContextFS Algorithm) and those selected by the CFS algorithm. We hypothesized that mining the preprocessed data would provide better results, with respect to the original dataset, by using the second strategy.
We performed mining experiments using the M5P model tree algorithm implemented in Weka. The values of the RMSE error measure, as well as a contextbased metric that considers the tree interpretability, suggested that we can obtain better results when using our feature selection approach (second strategy). Statistical analysis of the results, with the Friedman test, showed that our contextbased approach significantly improves predictive measures while CFS worsens context measures. We concluded that data preprocessing, which considers the context involved, can improve the mining results and produce better interpretable models. As future studies, we plan to use the induced models, generated using the second strategy, to select the most promising subset of snapshots, out of a very large ensemble, for a given ligand.
Author's information
ATW current address:
LaCA  Labortório de Computação Aplicada, Departamento de Computação Aplicada, Universidade Federal de Santa Maria (UFSM), Santa Maria, RS, Brasil.
KSM current address:
ComBiL, Grupo de Biologia Computacional, Centro de Ciências Computacionais, Universidade Federal do Rio Grande (FURG), Rio Grande, RS, Brasil.
References
Han J, Kamber M: Data Mining: Concepts and Techniques. 2006, Morgan Kaufmann, 2
Alpaydin E: Introduction to Machine Learning. 2010, The MIT Press, 2
Freitas A, Wieser D, Apweiler R: On the importance of comprehensible classification models for protein function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010, 99: 172182.
Baralis E, Cagliero L, Cerquitelli T, Garza P, Marchetti M: CASMine: providing personalized services in contextaware applications by means of generalized rules. Knowl. Inf. Syst. 2011, 28 (2): 283310. 10.1007/s101150100359z.
Nam MY, Rhee PK: Pattern recognition using evolutionary classifier and feature selection. Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery. 2006, FSKD'06, Berlin, Heidelberg: SpringerVerlag, 393399.
Kuntz ID: Structurebased Strategies for Drug Design and Discovery. Science. 1992, 257: 10781082. 10.1126/science.257.5073.1078.
Huang SY, Zou X: Ensemble docking of multiple protein structures: Considering protein structural variations in molecular docking. Proteins. 2006, 66: 399421. 10.1002/prot.21214.
Koshland DE: Application of a Theory of Enzyme Specificity to Protein Synthesis. Proceedings of the National Academy of Sciences. 1958, 44 (2): 98104. 10.1073/pnas.44.2.98. [http://www.pnas.org/content/44/2/98.full.abstract]
Lybrand T: LigandProtein Docking and Rational Drug Design. Curr. Opin. Struct. Biol. 1995, 5: 224228. 10.1016/0959440X(95)800808.
Machado KS, Schroeder EK, Ruiz DD, Cohen EML, Norberto de Souza O: FReDoWS: a method to automate molecular docking simulations with explicit receptor flexibility and snapshots selection. BMC Genomics. 2011, 12 (Supl4): 114.
Schroeder E, Basso L, Santos D, Norberto de Souza O: Molecular Dynamics Simulation Studies of the WildType, I21V, and I16T Mutants of IsoniazidResistant Mycobacterium tuberculosis Enoyl Reductase (InhA) in Complex with NADH: Toward the Understanding of NADHInhA Different Affinities. Biophys. J. 2005, 89: 876884. 10.1529/biophysj.104.053512.
Luscombe NM, Greenbaum D, Gerstein M: What is Bioinformatics? a Proposed Definition and Overview of the Field. Methods of Information in Medicine. 2001, 4: 346358.
Praneenararat T, Takagi T, Iwasaki W: Integration of interactive, multiscale network navigation approach with Cytoscape for functional genomics in the big data era. BMC Genomics. 2012, 13 (Suppl 7): S2410.1186/1471216413S7S24. [http://www.biomedcentral.com/14712164/13/S24/S24]
Dessen A, Quemard A, Blanchard J, Jacobs W, Sacchettini J: Crystal Structure and Function of the Isoniazid Target of Mycobacterium tuberculosis. Science. 1995, 267: 16381641. 10.1126/science.7886450.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne P: PDB  Protein Data Bank. Nucl. Acids Res. 2000, 28: 235242. 10.1093/nar/28.1.235.
Kuo M, Morbidoni H, Alland D, Sneddon S, Gourlie B, Staveski M, Leonard M, Gregory J, Janjigian A, Yee C, Musser J, Kreiswirth B, Iwamoto H, Perozzo R, Jacobs W, Sacchettini J, Fodock D: Targeting Tuberculosis and Malaria through Inhibition of Enoyl Reductase: Compound Activity and Structural Data. J. Biol. Chem. 2003, 278 (23): 2085120859. 10.1074/jbc.M211968200.
Oliveira JS, Sousa EHS, Basso LA, Palaci M, Dietze R, Santos DS, Moreira I: An Inorganic Iron Complex that Inhibits Wildtype and an Isoniazidresistant Mutant 2transenoylACP (CoA) Reductase from Mycobacterium tuberculosis. Chem. Commun. 2004, 15: 312313.
Wang F, Langley R, Gulten G, Dover L, Besra G, Jacobs WJ, Sacchettini J: Mechanism of thioamide drug action against tuberculosis and leprosy. J. Exp. Med. 2007, 204: 7378. 10.1084/jem.20062100.
Winck A, Machado K, Norberto de Souza O, Ruiz DD: FReDD: Supporting mining strategies through a flexiblereceptor docking database. Brazilian Symposium on Bioinformatics, Volume 5676 of LNBILNCS Advances in Bioinformatics and Computational Biology. 2009, Porto Alegre, Rio Grande do Sul, Brasil: Springer Berlin / Heidelberg, 143146.
Winck A, Machado K, Norberto de Souza O, Ruiz DD: Supporting Intermolecular Interaction Analyses of FlexibleReceptor Docking Simulations. IADIS International Conference Applied Computing. 2010, Timisoara, Romania, 18.
Machado KS, Winck AT, Ruiz DD, Norberto de Souza O: Mining flexiblereceptor docking experiments to select promising protein receptor snapshots. BMC Genomics. 2010, 11 (5): 113.
Machado KS, Winck AT, Ruiz DD, Cohen EML, Norberto de Souza O: Mining Flexiblereceptor Docking Data. WIREs Data Mining and Knowledge Discovery. 2011, 1: 532541. 10.1002/widm.46.
Winck AT, Machado KS, Ruiz DD, de Lima VLS: Association Rules to Identify Receptor and Ligand Structures through Named Entities Recognition. IEA/AIE. 2010, 119128. 3
Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2011, Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 3
Morris GM, Goodsell DS, Halliday R, Huey R, Hart W, Belew RK, Olson AJ: Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J. Comput. Chem. 1998, 19 (14): 16391662. 10.1002/(SICI)1096987X(19981115)19:14<1639::AIDJCC10>3.0.CO;2B.
Machado K, Winck A, Ruiz DD, Norberto de Souza O: Discretization of FlexibleReceptor Docking Data. Brazilian Symposium on Bioinformatics, Volume 6268 of LNBILNCS Advances in Bioinformatics and Computational Biology. 2010, Buzios, Rio de Janeiro, Brasil: Springer Berlin / Heidelberg, 7579.
Machado K, Winck A, Ruiz DD, Norberto de Souza O: Comparison of Discretization Methods of FlexibleReceptor Docking Data for Analyses by Decision Trees. IADIS International Conference Applied Computing. 2010, Timisoara, 223229.
Jeffrey GA: An introduction to hydrogen bonding. 1997, Oxford University Press, New York
da Silveira CH, Pires DEV, Minardi RC, Ribeiro C, Veloso CJM, Lopes JCD, Meira W, Neshich G, Ramos CHI, Habesch R, Santoro MM: Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins: Structure, Function, and Bioinformatics. 2009, 74 (3): 727743. 10.1002/prot.22187.
Hall MA, Smith LA: Feature subset selection: a correlation based filter approach. Proceedings of the 1997 International Conference on Neural Information Processing and Intelligent Information Systems: New Zealand. 1997, 855858.
Quinlan JR: Learning with Continuous Classes. Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. 1992, 343348. World Scientific
Barros RC, Winck AT, Machado KS, Basgalupp MP, Carvalho AC, Ruiz DD, Norberto de Souza O: Automatic design of decisiontree induction algorithms tailored to flexiblereceptor docking data. BMC Bioinformatics. 2012, 13 (310): 114.
Siegel S, Castellan N: Nonparametric statistics for the behavioral sciences. 1988, McGrawHill, Inc, 2
Acknowledgements
We thank the reviewers for their comments and suggestions which helped to improve the manuscript. This work was supported in part by grants (305984/20128, 559917/20104, 551209/20100) from the Brazilian National Research and Development Council (CNPq) to ONS. ONS is a CNPq Research Fellow. ATW and KSM PhD scholarships were funded by CNPq and Brazilian Coordination of Improvement of Higher Education Personnel (CAPES). ATW and DDAR research missions to University of Newcastle were funded by European Commission FP7 Marie Curie IRSES grant, CILMI project.
Declarations
Publication of this article has been funded by the authors.
This article has been published as part of BMC Genomics Volume 14 Supplement 6, 2013: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (Xmeeting 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S6.
Author information
Authors and Affiliations
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
ATW and KSM executed the preprocessing for the data mining experiments, performed all the data mining experiments, evaluated the models results and wrote the first draft of the article. DDAR helped to conceive the test cases and to evaluate the models. ONS helped to analyze the results and to write the final version of the article. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Winck, A.T., Machado, K.S., de Souza, O.N. et al. Contextbased preprocessing of molecular docking data. BMC Genomics 14 (Suppl 6), S6 (2013). https://doi.org/10.1186/1471216414S6S6
Published:
DOI: https://doi.org/10.1186/1471216414S6S6
Keywords
 Feature Selection
 Target Attribute
 Docking Simulation
 Docking Result
 Predictive Attribute