BPLT+: A Bayesian-based personalized recommendation model for health care

Zhao, Jiashu; Huang, Jimmy Xiangji; Hu, Xiaohua

doi:10.1186/1471-2164-14-S4-S6

Volume 14 Supplement 4

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics

Research
Open access
Published: 01 October 2013

BPLT⁺: A Bayesian-based personalized recommendation model for health care

Jiashu Zhao^1,2,
Jimmy Xiangji Huang^1,3 &
Xiaohua Hu⁴

BMC Genomics volume 14, Article number: S6 (2013) Cite this article

2346 Accesses
2 Citations
Metrics details

Abstract

In this paper, we propose an Advanced Bayesian-based Personalized Laboratory Tests recommendation (BPLT⁺) model. Given a patient, we estimate whether a new laboratory test should belong to a "taken" or "not-taken" class. We use the bayesian method to build a weighting function for a laboratory test and the given patient. A higher weight represents that the laboratory test has a higher probability of being "taken" by the patient and lower probability of being "not-taken" by the patient. For the sake of effectiveness and robustness, we further integrate several modified smoothing techniques into the model. In order to evaluate BPLT⁺ model objectively, we propose a framework where the data set is randomly split into a training set, a validation input set and a validation label set. A training matrix is generated from the training data set. Then instead of accessing the training data set repeatedly, we utilize this training matrix to predict the laboratory test on the validation input set. Finally, the recommended ranking list is compared with the validation label set using our proposed metric CorrectRate_M. We conduct experiments on real medical data, and the experimental results show the effectiveness of the proposed BPLT⁺ model.

Background

Large amounts of clinic laboratory test data are collected and stored every day. Therefore, there is an increasing need for analyzing and utilizing the laboratory test data. The problem we are working on in this paper is to recommend laboratory tests for given patients. Health care recommendation problems have drawn researchers' attention for years. However, there are not a lot of studies conducted on the clinic laboratory test recommendation problem.

The medical data we are working on contains several years patients' laboratory test records. Figure 1 shows an example of the data format. Formally, the laboratory test prediction problem can be described as follows [1]: "Given a set of patients P = {p₁, p₂, ..., p_n} and a set of laboratory tests T = {test₁, test₂, ... test_M}, each patient p_j has done tests test_j,1, ..., test_j,kj. If a doctor would like to assign a new test for patient p_j, which test in T should be chosen?"

The computer systems have been playing for an important role in health care for years [2–8]. Statistic algorithms [9–12] lead an important role in investigating health care data. [13, 14] extracts chemical keywords from a query patent by analyzing word frequency and the word's effect over the data collection. Bayesian learning is a widely used algorithm that shows good performance [15–19]. A semantic-based association rule mining approach is proposed to model the medical query contexts in [20]. Using a novel classifier based on the Bayesian discriminant function, Raymer, M. L. [21] present a hybrid algorithm that employs feature selection and extraction to isolate salient features from large medical and other biological data sets. Martín and Pérez [22] analyze the robustness of the optimal action in a Bayesian decision making problem in the context of health care. [23, 24] studies the association between two words by simulating the impact of words in documents in the context of information retrieval. A probabilistic survival model is derived from the survival analysis theory for measuring aspect novelty of genomics data [25]. A mixture markov model is proposed to investigate user navigation patterns so that a personalized recommendation system for each user can be built [26]. In our previous work [1], we propose a laboratory test prediction model, which would objectively determine whether a laboratory test is associated to a patient. This paper is a significant extension to [1].

Smoothing [27] is a technique to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. The smoothing techniques have been used in many realms to improve the accuracy [28]. Based on the basic Bayesian algorithm and smoothing techniques, we propose an Advanced Bayesian-based Personalized Laboratory Tests recommendation (BPLT⁺) model, to investigate the correlation among laboratory tests for each patient. Evaluation is a crucial issue in the health care domain [29]. Some previous health care researchers do evaluation via patient interaction [30] or statistics [31]. We present a metric CorrectRate_X by employing the idea of Mean Average Precision (MAP) [32] in Information Retrieval domain.

Four unique contributions are presented in this paper. Firstly, we learn the associations among laboratory tests and make personalized recommendations to patients without human interaction. Secondly, we integrate modified smoothing technologies to improve the personalized recommendation model and propose the BPLT+ model. Thirdly, we propose a framework to randomly generate a training data set, a validation input set and a validation label set. Fourth, we use a objective evaluation metric for personalized recommendation systems without patient interaction.

Methods

Bayesian-Based personalized laboratory tests recommendation (BPLT) model

Here we assume that the laboratory tests for a patient have associations among each other. For instance, if a patient is suspected to have diabetes, usually the doctor will assign both Hemoglobin test and Glucose Fasting test for this patient. We can see that there exists an association between Hemoglobin and Glucose Fasting with respect to some hidden information, diabetes in this case. On the other hand, if a patient is assigned Hemoglobin test, then it is very likely that this patient should also take Glucose Fasting test. In this section, we build a model for learning the associations of the laboratory tests, inferring the associations between patients and laboratory tests, and therefore recommending new laboratory tests to the patients. We regard the test recommendation problem as a special classification problem, where a test belongs to either a "taken" or "not-taken" class. We use Bayesian classifier as our basic classifier, and modify it to a personalized ranking model.

Basic concept: Bayesian classifier

A classification problem is the following [33]: given a set of training instances, each described with a set of n attributes and each belonging to exactly one of a certain number of possible classes, learn to classify new, unseen objects. In addition, each attribute has a fixed number of possible values. We use naive Bayesian classifier as our basic classifier in this paper, since it evaluates directly the probability of taking a test and the conditional probability among two tests. Moreover, naive Bayesian is easy to construct and has surprisingly good performance in classification, even though the conditional independence assumption is rarely true in real-world applications [34]. The probability model for a classifier is a conditional model

Pr (C | F_{1}, . . ., F_{n})

(1)

where F₁, ..., F_n are attributes, and C is a class variable. By Bayesian criteria, it equals to

\frac{Pr (C) Pr (F_{1}, . . ., F_{n} | C)}{Pr (F_{1}, . . ., F_{n})}

(2)

The denominator is effectively constant, and the numerator is equivalent to the joint probability model

\begin{array}{l} Pr (C, F_{1}, . . ., F_{n}) \\ = Pr (C) P r (F_{1} | C) Pr (F_{2} | C, F_{1}) Pr (F_{3} | C, F_{1}, F_{2}) . . . Pr (F_{n} | C, F_{1}, . . ., F_{n - 1}) \end{array}

In naive Bayesian, it assumes the features are conditional independent

Pr (F_{i} | C, F_{j, 0}) = Pr (F_{i} | C), f o r i \neq j

Therefore, the probability of a class C given feature F₁, ..., F_n is

Pr (C | F_{1}, . . ., F_{n}) = A Pr (C) \prod_{i = 1}^{n} (F_{i} | C)

(3)

where $A = \frac{1}{Pr (F_{1}, . . ., F_{n})}$ is a constant.

The weighting function of BPLT model

In this Section, we describe the Bayesian-based Personalized Laboratory Tests recommendation (BPLT) model, which was proposed in our previous work [1]. More details are given in this paper. The purpose of BPLT model is to classify the laboratory tests for individual patients by their personal conditions. In the real world, it is often easier to obtain the patients' previous laboratory tests information. Therefore, the BPLT model recommends additional new laboratory tests to patients, given the previous laboratory tests that the patients have taken.

Suppose we have a set of M laboratory tests T = {test₁, test₂, ..., test_M }, and a patient p_j who has taken tests T_j = {test_j,₁, ..., test_j,kj } where test_j,i ∈ T for all 1 ≤ i ≤ k_j. We denote the events that tests in T_j are taken by p_j as F_j,₁, F_j,₂, ...F_j,M . For example, if we have 7 tests in T, and p_j has taken test₃, test₅ and test₇ could be represented as (F_j,₁, F_j,₂, ..., F_j,₇) = (0, 0, 1, 0, 1, 0, 1). Bayesian Classifier is employed to evaluate the association between p_j a new test test₀ where test₀ ∈ T and test₀ ∉ T_j. We use F_j,0to represent the event of p_j should take t₀, and $F_{j, 0}^{c}$ to represent the event of p_j should not take t₀. By Formula (3), the probability of F_j,₀ given F_j,₁, F_j,2, ...F_j,M is

Pr (F_{j, 0} | F_{j, 1}, F_{j, 2}, \dots F_{j, M}) \propto Pr (F_{j, 0}) \prod_{i = 1}^{M} Pr (F_{j, i} | F_{j, 0})

The probability of $F_{j, 0}^{c}$ given F_j,₁, F_j,₂, ... F_j,M is

Pr (F_{j, 0}^{c} | F_{j, 1}, F_{j, 2}, \dots F_{j, M}) \propto Pr (F_{j, 0}^{c}) \prod_{i = 1}^{M} Pr (F_{j, i} | F_{j, 0}^{c})

In the BPLT model, we reward the tests with high probability of "taken" and low probability of "not-taken". The correlation between a new test test₀ and a given patient p_j is shown in Definition 1 [1].

Definition 1 The correlation between a new test test₀ and a given patient p_j is defined as the log function of the probability of p_j should take test ₀ divided by the probability of p_j should not take test ₀ given F_j,1, F_j,₂, ... F_j,M.

c o r r (t e s t_{0}, p_{j}) = log \frac{Pr (F_{j, 0} | F_{j, 1}, F_{j, 2}, \dots F_{j, M})}{Pr (F_{j, 0}^{c} | F_{j, 1}, F_{j, 2}, \dots F_{j, M})}

(4)

We can see that higher value of corr(test₀, p_j) indicates that test₀ has more association with p_j. The calculation of corr(test₀, p_j) can be further simplified as follows

\begin{array}{l} c o r r (t e s t_{0}, p_{j}) \\ = log Pr (F_{j, 0} | F_{j, 1}, F_{j, 2}, \dots F_{j, M}) - log Pr (F_{j, 0}^{c} | F_{j, 1}, F_{j, 2}, \dots F_{j, M}) \\ = log Pr (F_{j, 0}) \prod_{i = 1}^{M} Pr (F_{j, i} | F_{j, 0}) - log Pr (F_{j, 0}^{c}) \prod_{i = 1}^{M} Pr (F_{j, i} | F_{j, 0}^{c}) \\ = log \frac{Pr (F_{j, 0})}{Pr (F_{j, 0}^{c})} + \sum_{i = 1}^{M} \frac{log Pr (F_{j, i} | F_{j, 0})}{Pr (F_{j, i} | F_{j, 0}^{c})} \end{array}

(5)

Moreover, a test either belongs to a "taken" class or a 'not taken" class. Thus, the following two formulas are held.

\begin{gathered} Pr (F_{j, 0}) + Pr (F_{j, 0}^{c}) = 1 \\ Pr (F_{j, i} | F_{j, 0}) Pr (F_{j, 0}) + Pr (F_{j, i} | F_{j, 0}^{c}) Pr (F_{j, 0}^{c}) = Pr (F_{j, i}) \end{gathered}

from which we can obtain $Pr (F_{j, 0}^{c})$ and $Pr (F_{j, i} | F_{j, 0}^{c})$

\begin{gathered} Pr (F_{_{j, 0}}^{c}) = 1 - Pr (F_{j, 0}) \\ Pr (F_{j, i} | F_{j, 0}^{c}) = \frac{Pr (F_{j, i}) - Pr (F_{j, i} | F_{j, 0}) Pr (F_{j, 0})}{1 - Pr (F_{j, 0})} \end{gathered}

Thus $Pr (F_{j, 0}^{c})$ and $Pr (F_{j, i} | F_{j, 0}^{c})$ in (5) can be eliminated in corr (test₀, p_j ), as shown below

log \frac{Pr (F_{j, 0})}{1 - Pr (F_{j, 0})} + \sum_{i = 1}^{M} log \frac{Pr (F_{j, i} | F_{j, 0}) (1 - Pr (F_{j, 0}))}{Pr (F_{j, i}) - Pr (F_{j, i} | F_{j, 0}) - Pr (F_{j, 0})}

A joint probability for patient p_j take both of the tests test_i and test₀ is

Pr (F_{j, i}, F_{j, 0}) = Pr (F_{j, i} | F_{j, 0}) Pr (F_{j, 0})

The definition of the correlation between test₀ and p_j is

\begin{array}{l} c o r r (t e s t_{0}, p_{j}) \\ = log \frac{Pr (F_{j, 0})}{1 - Pr (F_{j, 0})} + \sum_{i = 1}^{M} log \frac{Pr (F_{j, i}, F_{j, 0}) (1 - Pr (F_{j, 0}))}{Pr (F_{j, 0}) (Pr (F_{j, i}) - Pr (F_{j, i}, F_{j, 0}))} \\ = (k - 1) \cdot log \frac{1 - Pr (F_{j, 0})}{Pr (F_{j, 0})} + \sum_{i = 1}^{M} log \frac{Pr (F_{j, i} | F_{j, 0})}{Pr (F_{j, i}) - Pr (F_{j, i} | F_{j, 0})} \end{array}

which leads to the following Definition 2 [1].

Definition 2 The weighting function for a laboratory test test ₀ for a patient p _j is the simplified correlation between test ₀ and p _j

w (t e s t_{0}, p_{j}) = (k - 1) \cdot log \frac{1 - α}{α} + \sum_{i = 1}^{M} log \frac{β_{j, i}}{γ_{j, i} - β_{j, i}}

(6)

where

\begin{gathered} α = Pr (F_{j, 0}) = \frac{n u m b e r o f p a t i e n t s t a k e n t e s t_{0}}{n u m b e r o f p a t i e n t s} \\ γ_{j, i} = Pr (F_{j, i}) = \frac{n u m b e r o f p a t i e n t s t h a t F_{j, i} h o l d s}{n u m b e r o f p a t i e n t s} \\ β_{j, i} = Pr (F_{j, i} | F_{j, 0}) = \frac{Pr (F_{j, i}, F_{j, 0})}{Pr (F_{j, 0})} \\ = \frac{1}{α} \frac{n u m b e r o f p a t i e n t s t h a t b o t h F_{j, 0} a n d F_{j, i} h o l d s}{n u m b e r o f p a t i e n t s} \end{gathered}

The new laboratory tests will be ranked in a list according to w(test₀, p_j ) for a given patient p_j. In the later section, we will present the evaluation environments for the laboratory test ranking list.

An advanced model: BPLT⁺

To have a more robust and better performance model, we further propose an advanced model, BPLT⁺, by improving the BPLT model using several smoothing techniques. There are two reasons for smoothing BPLT. One reason is that smoothing is a way to deal with noise within the data. Another reason is to avoid the mathematically meaningless. When test⁰ laboratory test has not been observed in the previous visits, which means α = 0, the first part of formula (6) will become an irrational number. Meanwhile, when the joint frequency of two laboratory tests is zero, which means β_j,_i = 0, the second part of (6) will become an irrational number. Therefore, we introduce smoothing technologies to further improve BPLT model.

Smoothing techniques

In statistics, smoothing [27] is a technique to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. The main purpose of smoothing in this paper is to assign a non-zero probability to the unseen tests and improve the accuracy of test probability estimation in general.

The smoothing techniques are discussed based on the following definitions of a conditional probability [28].

Pr (t | p) = \frac{c (t; p)}{\sum_{t \in T} c (t; p)}

(7)

where c(t;p) is the count of a patient taking a test. Here are some commonly used smoothing methods. Since we have defined a ranking problem, which is similar to the problems in Information Retrieval (IR), we use some widely used smoothing methods in language model in IR. The general form of a smoothed model [35] is assumed to be the following:

Pr (t | p) = \{\begin{matrix} \underset{t}{Pr} (t | p) & i f t e s t t i s o b s e r v e d \\ Pr (t | C) & o t h e r w i s e \end{matrix}

(8)

where Pr_t(t|p) is the smoothed probability of a test t given the patient with existing tests. Pr(t|C) is the probability of a test t given the whole data set.

A smoothing method may be as simple as adding an extra count to every test, which is called additive or Laplace smoothing, or more sophisticated as in Katz smoothing, where tests of different count are treated differently. Three representative methods that are popular and effective are:

The Jelinek-Mercer method
$\underset{λ}{Pr} (t | p) = (1 - λ) Pr (t | p) + λ Pr (t | C)$
(9)

where λ is a balancing parameter ranges from 0 to 1.

Bayesian Smoothing using Dirichlet Priors
$\underset{μ_{0}}{Pr} (t | p) = \frac{c (t; p) + μ_{0} Pr (t | C)}{\sum_{t \in T} c (t; p) + μ_{0}}$
(10)

where µ₀ is a balancing parameter, and µ₀ >0. The Laplace method is a special case of this technique.

Absolute Discounting
$\underset{δ}{Pr} (t | p) = \frac{m a x (c (t; p) - δ, 0)}{\sum_{t \in T} c (t; p)} + σ p (t | C)$
(11)

where δ ∈ [0, 1] is a discount constant and σ = δ|p|_u/|p|, so that all probabilities sum to one. Here |p|_u is the number of unique terms in document d, and |p| is the total count of words in the documents.

BPLT⁺with smoothing techniques

There are two parts in formula (6) that need smoothing. The first one is the conditional probability β_j,_i = Pr(F_j,i|F_j,0). Its smoothed format is as follows:

BPLT⁺ with Jelinek-Mercer
$β_{j, i}^{λ} = (1 - λ) β_{j, i} + λ γ_{j, i}$
(12)
BPLT⁺ with dirichlet priors
$β_{j, i}^{μ} = \frac{β_{j, i} + μ γ_{j, i}}{1 + μ}$
(13)
BPLT⁺ with absolute discounting
$β_{j, i}^{δ} = \frac{m a x (c (t; p) - δ, 0)}{\sum_{t \in T} c (t; p)} + δ γ_{j, i}$
(14)

In Jelinek-Mercer BPLT⁺ and Absolute Discounting BPLT⁺, we use the existing smoothing method. The smoothing parameters λ, δ are within the range of [0, 1]. In Dirichlet Priors BPLT⁺, we modify the Dirichlet smoothing technique, by divide both the numerator and the denominator in (10) by $\sum_{t \in T} c (t; p)$ , and normalize the parameter µ to the range of 0[1], where $μ = \frac{μ_{0}}{\sum_{t \in T} c (t; p)}$ .

Another part in formula (6) needs smoothing is $log \frac{α}{1 - α}$ , which is a simple division that could be smoothed

via Laplace smoothing as

log (\frac{α + θ}{1 - α + θ})

(15)

where θ is a tuning parameter ranges from 0 to 1.

Evaluation environments

Datasets

The datasets in our experiment are obtained from Alpha Global IT [1, 36]. Alpha Corporate Group provides laboratory, medical clinic, commercial electronic medical record and practice management software. The data set contains 78 monthly patient's laboratory test results. Our experiments use 6 month results, containing 1,048,575 patients' records, as a key study. Thousands of patients' records and more than 400 laboratory tests are included in our experiments. The data format is the same as the example shown in Figure 1. Our data set contains real patients' information, such as health card ID, age, gender, date of visit, laboratory test ID, laboratory test results. We only use the patient ID and laboratory ID attributes in this paper, and analyze the associations among these laboratory tests. In our future work, we will incorporate more attributes in the laboratory recommendation model.

Validation data and measure

To evaluate BPLT⁺ models objectively, we divide the data set into three components: a training set, a validation input set, and a validation label set. The data set is firstly randomly split into a training set and a validation set. In this step, we split based on the patients and do not split the records from a same patient. Then for the validation set, we randomly remove one test t^* from each patient p_j, and store the t^* in the validation label set. The ranked list returned by BPLT⁺ will be compared with t^* for each patient. To measure such comparison and finally evaluate the effectiveness of BPLT⁺, we use the following defined CorrectRate_X [1]. Suppose the returned laboratory ranking list is $L = t_{1, j}^{'}, \dots t_{l, j}^{'}$ , CorrectRate_X validates whether t^* appears in the top ranked tests. The measure is modified from Mean Average Precision (MAP) [32] evaluation metric.

Definition 3 The CorrectRate_X evaluates the accuracy of a laboratory tests prediction system. It is the number of patients with the desired (golden standard) test matching one of the top X tests generated by the system, divided by the total number of the patients.

C o r r e c t R a t e_{X} = \frac{\sum_{j = 1}^{n} T O P_{j, X}}{n}

(16)

where

T O P_{j, X} = \{\begin{matrix} 1 i f t^{*} m a t c h e s a t e s t i n {{t^{'}}_{1, j}, \dots {t^{'}}_{X, j}} \\ 0 o t h e r w i s e \end{matrix}

n is the number of patients, X is a parameter indicating how many top tests are compared to the golden standard test t*, which is set to be 1 or 3 in this paper.

We present an example to show how the CorrectRate_X evaluates the model in Table 1. Suppose the laboratory test sets includes 200 tests and there are 5 patients in the validation set. As we have introduced, the BPLT⁺ model returns a ranked list for each patient. Here ">" represents that the weight of the left-side laboratory test is higher than the weight of the right-side laboratory test. In our example, 2 out of 5 patients have the desired test t^* ranked in the top 1 position of the list, then CorrectRate₁ equals 0.4. And 4 out of 5 patients have t^* appears within the top 3 positions of the returned ranking list, then CorrectRate₃ equals 0.8. We can see that the top 3 positions include the top 1 position, so the following statement is always true: CorrectRate₁ ≤ CorrectRate₃.

Table 1 An example of CorrectRate_X

Full size table

BPLT⁺System Framework

The framework of BPLT⁺ Model is shown in Figure 2. The data set in this framework is abstracted to contain only patient ID and laboratory test ID. The procedures in the proposed framework are described as follows.

• Split: First the data set is randomly split into a training set and a validation set.

Random Remove a test as label: Since it is hard to objectively evaluate the performance of the BPLT⁺ model, we further randomly remove a test for each visit of the patients from the validation set. These removed tests are regarded as labels of the validation set input. Our ultimate goal is to recommend the missing test for a patient's visit.

• Build training matrix: To avoid duplicate calculating the frequency of a test and the joint frequency between two tests, we build a training matrix out of the training data. This training matrix contains the frequency of co-occurrences of two laboratory tests. For example, if a patient in the training data did test₁ and test₂ together, then add 1 to F₁₂ and F₂₁. We can see that the training matrix is a symmetric matrix.

• BPLT⁺model: The correlation of a given test₀ and a patient is calculated based on formula (6).

• Evaluation via CorrectRate_X: Finally, the evaluation criteria CorrectRate_Xevaluates if the model made the correct recommendations.

Results

We first show the overall performance under different training-validation proportion in Table 2[1]. We randomly take 40%, 50% and 60% of the data out of the raw data set as the training data and keep the rest as the validation data. In general, there is higher performance of BPLT⁺ model on a larger training data set. This is because the larger training data set contains more information, and more knowledge can be learned. With the development of computer technology, larger amount of medical data will be available in practice. Therefore, we will use 60% of data as training data in the rest of this paper. As we have discussed before, CorrectRate₃ is always higher than CorrectRate₁. In general, the BPLT⁺ model has promising performance with an accuracy of 0.7074 for CorrectRate₁ and an accuracy of 0.7840 for CorrectRate₃.

Table 2 Performance

Full size table

Then we investigate how the smoothing parameters affect the effectiveness in detail. We first consider smoothing β_j,i only. There are three smoothing technologies utilized to smooth β_j,i. They are Jelinek-Mercer BPLT⁺, Dirichlet Priors BPLT⁺ and Absolute Discounting BPLT⁺, with the corresponding parameters: λ, µ, δ ∈ [0, 1]. We conduct experiments on these three methods individually. The change of CorrectRate₁ and CorrectRate₃ with respect to the parameters are shown in Figure 3, Figure 4, and Figure 5. We can see from the figures that the curve of CorrectRate₁ is always below the curve of CorrectRate₃, which is consistent as we have discussed Definition 3. With the increasing of parameters from 0.1 to 1, both CorrectRate₁ and CorrectRate₃ become higher at the beginning due to the incorporating of the smoothing portion. After reaching the maximum value, CorrectRate₁ and CorrectRate₃ become lower, since the weighing would tend to be more universal when too much smoothing is incorporated. All the smoothing parameters achieve their best performance at the value of 0.2. Comparing among these three methods, Jelinek-Mercer BPLT⁺ obtains the best performance on both CorrectRate₁ and CorrectRate₃, which are 0.5569 and 0.6167. When it comes to the average value, Dirichlet Priors BPLT⁺'s average performance on CorrectRate₃ is better than the other two, and Jelinek-Mercer BPLT⁺'s average performance on CorrectRate₁ is the best.

We further discuss to smooth the second part of (6), where the Laplace smoothing parameter is θ. As we have discussed before, Jelinek-Mercer BPLT⁺ has the best performance on both CorrectRate₁ and CorrectRate₃. We focus on investigating the sensitivity of θ by fixing Jelinek-Mercer BPLT⁺ with λ = 0.2. The results are shown in Figure 6. We can see that the CorrectRate₁ increases while θ is increasing, and the CorrectRate₃ decreases a little and then increases. Both of them reach the maximum and tend to be stable when θ is greater than 0.5.

Conclusions and future work

An Advanced Bayesian based Personalized Laboratory Tests recommendation (BPLT⁺) model is proposed in this paper. Based on the assumption that hidden association could exist among laboratory tests, we employ a Bayesian approach to build a weighting function for scoring the correlation between a new laboratory test and a patient. To have a more robust and better performance model, we employ several enhanced smoothing technologies into the BPLT⁺ model. The main purpose of smoothing in this paper is to assign a non-zero probability to the unseen laboratory tests and improve the accuracy of test probability estimation. We integrate existing smoothing techniques in the BPLT⁺ model. In particular, we use three techniques, Jelinek-Mercer, Dirichlet Priors and Absolute Discounting approaches, to smooth the conditional probability of observing a patient taking an existing test when a new test test₀ is given (Formula 12-14). Also we use Laplace method to smooth the log function in the BPLT⁺ model (Formula 15). We conducted experiments to discuss the performance of the BPLT⁺ model and the sensitivity of smoothing parameters. We find that BPLT⁺ is able to make accurate recommendations under proper smoothing parameters.

Further, we propose a novel framework for effectively implementing BPLT⁺ model and objectively testing personalized recommendation systems without human interactions, shown in Figure 2. Based on the real patients' laboratory test data, we randomly generate a training data set, a validation input set and a validation label set. A training matrix containing the laboratory test statistics is calculated from the training data set and stored. For new patients (the validation input set), instead of processing the original training set, we utilize this training matrix to predict the laboratory test on the validation input set, and compare the ranking results with the validation label set.

There are a few future directions of this research work. As we can see from the data format in Figure 1, we have not make use of all the attributes. In the future, we would like to conduct a comprehensive investigation for the patients' profiles. For example, we can cluster the patients into groups and investigate the similarities of the patients in the same group. We can also analyze the associations among laboratory test results and therefore further enhance our proposed personalized recommendation model. Moreover, we look forward to testing our proposed models in more real applications.

References

Zhao J, Huang JX, Hu X, Kurian J, Melek W: A Bayesian-based prediction model for personalized medical health care. Bioinformatics and Biomedicine (BIBM). 2012, 1-4. 10.1109/BIBM.2012.6392623. IEEE International Conference on: 4-7 October 2012
Google Scholar
Bates D, Cohen M, Leape L, Overhage J, Shabot M, Sheridan T: Reducing the frequency of errors in medicine using information technology. Journal of the American Medical Informatics Association. 2001, 8 (4): 299-308. 10.1136/jamia.2001.0080299.
Article PubMed Central CAS PubMed Google Scholar
Ogiela L, Tadeusiewicz R, Ogiela M: Cognitive techniques in medical information systems. Computers in Biology and Medicine. 2008, 38 (4): 501-507. 10.1016/j.compbiomed.2008.01.017.
Article PubMed Google Scholar
Shortliffe E, Cimino J: Biomedical informatics: computer applications in health care and biomedicine. 2006, Springer
Book Google Scholar
Melski J, Geer D, Bleich H: Medical information storage and retrieval using preprocessed variables. Computers and Biomedical Research, An International Journal. 1978, 11 (6): 613-10.1016/0010-4809(78)90038-1.
Article CAS PubMed Google Scholar
Thoma G, Suthasinekul S, Walker F, Cookson J, Rashidian M: A prototype system for the electronic storage and retrieval of document images. ACM Transactions on Information Systems. 1985, 3 (3): 279-291. 10.1145/4229.4232.
Article Google Scholar
Frick S, Uehlinger D, Zenklusen R: Medical futility: Predicting outcome of intensive care unit patients by nurses and doctors-A prospective comparative study*. Critical Care Medicine. 2003, 31 (2): 456-461. 10.1097/01.CCM.0000049945.69373.7C.
Article PubMed Google Scholar
Wu W, Bui A, Batalin M, Au L, Binney J, Kaiser W: MEDIC: Medical embedded device for individualized care. Artificial Intelligence in Medicine. 2008, 42 (2): 137-152. 10.1016/j.artmed.2007.11.006.
Article PubMed Google Scholar
Kajíc V, Esmaeelpour M, Považay B, Marshall D, Rosin P, Drexler W: Automated choroidal segmentation of 1060 nm OCT in healthy and pathologic eyes using a statistical model. Biomedical Optics Express. 2012, 3 (1): 86-103. 10.1364/BOE.3.000086.
Article PubMed Central PubMed Google Scholar
Kokol P, Pohorec S, Štiglic G, Podgorelec V: Evolutionary design of decision trees for medical application. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012, 2 (3): 237-254. 10.1002/widm.1056.
Google Scholar
Pepe M: The statistical evaluation of medical tests for classification and prediction. 2004, Oxford University Press, USA
Google Scholar
Rohian H, An A, Zhao J, Huang X: Discovering temporal associations among significant changes in gene expression. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine, IEEE. 2009, 419-423.
Google Scholar
Lupu M, Huang XJ, Zhu J: TREC Chemical Information Retrieval - An Initial Evaluation Effort for Chemical IR Systems. World Patent Information Journal. 2011, 33 (3): 248-256. 10.1016/j.wpi.2011.03.002.
Article CAS Google Scholar
Zhao J, Huang X, Ye Z, Zhu J.: York University at TREC 2009: Chemical Track. Proceedings of the 18th Text REtrieval Conference. 2009
Google Scholar
Bernardo J, Smith A: Bayesian theory. Measurement Science and Technology. 2001, 12: 221-222.
Google Scholar
Chen J, Huang H, Tian F, Tian S: A selective bayes classifier for classifying incomplete data based on gain ratio. Knowledge-Based Systems. 2008, 21 (7): 530-534. 10.1016/j.knosys.2008.03.013.
Article Google Scholar
Clèries R, Ribes J, Buxo M, Ameijide A, Marcos-Gragera R, Galceran J, Martínez J, Yasui Y: Bayesian approach to predicting cancer incidence for an area without cancer registration by using cancer incidence data from nearby areas. Statistics in Medicine. 2012
Google Scholar
Huang X, Hu Q: A Bayesian Learning Approach to Promoting Diversity in Ranking for Biomedical Information Retrieval. Proceedings of the 32nd Annual International Conference on Research and Development in Information Retrieval. 2009, 19-23.
Google Scholar
Liechty J, Liechty M, Muller P: Bayesian correlation estimation. Biometrika. 2004, 91: 1-10.1093/biomet/91.1.1.
Article Google Scholar
Babashzadeh A, Daoud M, Huang J: Using semantic-based association rule mining for improving clinical text retrieval. Health Information Science. 2013, 186-197.
Chapter Google Scholar
Raymer M, Doom T, Kuhn L, Punch W: Knowledge discovery in medical and biological datasets using a hybrid bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B. 2003, 33 (5): 802-813. 10.1109/TSMCB.2003.816922.
Article CAS Google Scholar
Martín J, Pérez C, Muller P: Bayesian robustness for decision making problems: Applications in medical contexts. International Journal of Approximate Reasoning. 2009, 50 (2): 315-323. 10.1016/j.ijar.2008.03.017.
Article Google Scholar
Hu Q, Huang X: Passage Extraction and Result Combination for Genomics Information Retrieval. Journal of Intelligent Information Systems. 2010, 34 (3): 249-274. 10.1007/s10844-009-0097-4.
Article Google Scholar
Zhao J, Huang JX, He B: CRTER: using cross terms to enhance probabilistic information retrieval. Proceedings of the 34th international ACM SIGIR conference, ACM. 2011, 155-164.
Google Scholar
Yin X, Huang JX, Li Z, Zhou X: A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia. IEEE Transactions on Knowledge and Data Engineering. 2013, 25 (6): 1201-1212.
Article Google Scholar
Liu Y, Huang JX, An A: Personalized recommendation with adaptive mixture of markov models. Journal of the American Society for Information Science and Technology. 2007, 58 (12): 1851-1870. 10.1002/asi.20631.
Article Google Scholar
Titterington D: Common structure of smoothing techniques in statistics. International Statistical Review/Revue Internationale de Statistique. 1985, 141-170.
Google Scholar
Zhai C, Lafferty J: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems. 2004, 22 (2): 179-214. 10.1145/984321.984322.
Article Google Scholar
Kononenko I: Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine. 2001, 23: 89-109. 10.1016/S0933-3657(01)00077-X.
Article CAS PubMed Google Scholar
Donabedian A: Evaluating the quality of medical care. Milbank Quarterly. 2005, 83 (4): 691-10.1111/j.1468-0009.2005.00397.x.
Article PubMed Central PubMed Google Scholar
Cook N: Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clinical chemistry. 2008, 54: 17-
Article CAS PubMed Google Scholar
Sanderson M: Information retrieval system evaluation: effort, sensitivity, and reliability. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. 2005, 162-169.
Google Scholar
Kononenko I: Inductive and bayesian learning in medical diagnosis. Applied Artificial Intelligence. 1993, 7 (4): 317-337. 10.1080/08839519308949993.
Article Google Scholar
Zhang H, Su J: Naive bayesian classifiers for ranking. Machine Learning: ECML. 2004, 501-512.
Google Scholar
Chen S, Goodman J: An empirical study of smoothing techniques for language modeling. Computer Speech and Language. 1999, 13 (4): 359-394. 10.1006/csla.1999.0128.
Article Google Scholar
Alpha Global IT: [http://www.alpha-it.com/]

Download references

Acknowledgements

This research is supported in part by the research grant from the Natural Sciences & Engineering Research Council (NSERC) of Canada and the Early Research Award/Premier's Research Excellence Award. The authors thank Dr. Joseph Kurian and Dr. William Melek from Alpha Global IT for their help and providing the data. In particular, we thank anonymous reviewers for their valuable and detailed comments on this paper.

Based on "A Bayesian-based prediction model for personalized medical health care", by Jiashu Zhao, Jimmy Xiangji Huang, Xiaohua Hu, C Joseph Kurian, and William Melek which appeared in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. ©2012 IEEE, 579-582.

Declarations

The publication costs for this article were funded by the corresponding author.

This article has been published as part of BMC Genomics Volume 14 Supplement S4, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S4.

Author information

Authors and Affiliations

Information Retrieval and Knowledge Management Research Lab, York University, Toronto, ON, M3J1P3, Canada
Jiashu Zhao & Jimmy Xiangji Huang
Department of Computer Science and Engineering, York University, Toronto, ON, M3J1P3, Canada
Jiashu Zhao
School of Information Technology, York University, Toronto, ON, M3J1P3, Canada
Jimmy Xiangji Huang
College of Information Science, Drexel University, Philadelphia, PA, 19104, USA
Xiaohua Hu

Authors

Jiashu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Xiangji Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jimmy Xiangji Huang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JZ proposed BPLT⁺ model, carried on the experiments and drafted the manuscript. JXH supervised the project and revised the manuscript. JXH also contributed in the study design and experiments. XH provides useful feedback. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Zhao, J., Huang, J.X. & Hu, X. BPLT⁺: A Bayesian-based personalized recommendation model for health care. BMC Genomics 14 (Suppl 4), S6 (2013). https://doi.org/10.1186/1471-2164-14-S4-S6

Download citation

Published: 01 October 2013
DOI: https://doi.org/10.1186/1471-2164-14-S4-S6

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics

BPLT⁺: A Bayesian-based personalized recommendation model for health care

Abstract

Background

Methods