Skip to main content

BPLT+: A Bayesian-based personalized recommendation model for health care

Abstract

In this paper, we propose an Advanced Bayesian-based Personalized Laboratory Tests recommendation (BPLT+) model. Given a patient, we estimate whether a new laboratory test should belong to a "taken" or "not-taken" class. We use the bayesian method to build a weighting function for a laboratory test and the given patient. A higher weight represents that the laboratory test has a higher probability of being "taken" by the patient and lower probability of being "not-taken" by the patient. For the sake of effectiveness and robustness, we further integrate several modified smoothing techniques into the model. In order to evaluate BPLT+ model objectively, we propose a framework where the data set is randomly split into a training set, a validation input set and a validation label set. A training matrix is generated from the training data set. Then instead of accessing the training data set repeatedly, we utilize this training matrix to predict the laboratory test on the validation input set. Finally, the recommended ranking list is compared with the validation label set using our proposed metric CorrectRate M . We conduct experiments on real medical data, and the experimental results show the effectiveness of the proposed BPLT+ model.

Background

Large amounts of clinic laboratory test data are collected and stored every day. Therefore, there is an increasing need for analyzing and utilizing the laboratory test data. The problem we are working on in this paper is to recommend laboratory tests for given patients. Health care recommendation problems have drawn researchers' attention for years. However, there are not a lot of studies conducted on the clinic laboratory test recommendation problem.

The medical data we are working on contains several years patients' laboratory test records. Figure 1 shows an example of the data format. Formally, the laboratory test prediction problem can be described as follows [1]: "Given a set of patients P = {p1, p2, ..., p n } and a set of laboratory tests T = {test1, test2, ... test M }, each patient p j has done tests testj,1, ..., test j,kj . If a doctor would like to assign a new test for patient p j , which test in T should be chosen?"

Figure 1
figure1

An example dataset. The format of the laboratory data sets is presented: the attributes from left to right are SDTE (SERVICE DATE), REQ# (REQUISITION NUMBER), PNUM (PATIENT HEALTH CARD#), PNAM (PATIENT NAME), PSEX (PATIENT SEX), BDTE (PATIENT DATE OF BIRTH), TSEQ (TEST SEQUENCE NUMBER), TEST (TEST CODE), DESC (TEST DESCRIPTION), RSLT (TEST RESULT), NORM (NORMAL RANGE), REXP (RESULT EXPECTED Y/N), EXRS (EXTENDED RESULT Y/N). The patient information in this table is fake due to privacy.

The computer systems have been playing for an important role in health care for years [28]. Statistic algorithms [912] lead an important role in investigating health care data. [13, 14] extracts chemical keywords from a query patent by analyzing word frequency and the word's effect over the data collection. Bayesian learning is a widely used algorithm that shows good performance [1519]. A semantic-based association rule mining approach is proposed to model the medical query contexts in [20]. Using a novel classifier based on the Bayesian discriminant function, Raymer, M. L. [21] present a hybrid algorithm that employs feature selection and extraction to isolate salient features from large medical and other biological data sets. Martín and Pérez [22] analyze the robustness of the optimal action in a Bayesian decision making problem in the context of health care. [23, 24] studies the association between two words by simulating the impact of words in documents in the context of information retrieval. A probabilistic survival model is derived from the survival analysis theory for measuring aspect novelty of genomics data [25]. A mixture markov model is proposed to investigate user navigation patterns so that a personalized recommendation system for each user can be built [26]. In our previous work [1], we propose a laboratory test prediction model, which would objectively determine whether a laboratory test is associated to a patient. This paper is a significant extension to [1].

Smoothing [27] is a technique to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. The smoothing techniques have been used in many realms to improve the accuracy [28]. Based on the basic Bayesian algorithm and smoothing techniques, we propose an Advanced Bayesian-based Personalized Laboratory Tests recommendation (BPLT+) model, to investigate the correlation among laboratory tests for each patient. Evaluation is a crucial issue in the health care domain [29]. Some previous health care researchers do evaluation via patient interaction [30] or statistics [31]. We present a metric CorrectRate X by employing the idea of Mean Average Precision (MAP) [32] in Information Retrieval domain.

Four unique contributions are presented in this paper. Firstly, we learn the associations among laboratory tests and make personalized recommendations to patients without human interaction. Secondly, we integrate modified smoothing technologies to improve the personalized recommendation model and propose the BPLT+ model. Thirdly, we propose a framework to randomly generate a training data set, a validation input set and a validation label set. Fourth, we use a objective evaluation metric for personalized recommendation systems without patient interaction.

Methods

Bayesian-Based personalized laboratory tests recommendation (BPLT) model

Here we assume that the laboratory tests for a patient have associations among each other. For instance, if a patient is suspected to have diabetes, usually the doctor will assign both Hemoglobin test and Glucose Fasting test for this patient. We can see that there exists an association between Hemoglobin and Glucose Fasting with respect to some hidden information, diabetes in this case. On the other hand, if a patient is assigned Hemoglobin test, then it is very likely that this patient should also take Glucose Fasting test. In this section, we build a model for learning the associations of the laboratory tests, inferring the associations between patients and laboratory tests, and therefore recommending new laboratory tests to the patients. We regard the test recommendation problem as a special classification problem, where a test belongs to either a "taken" or "not-taken" class. We use Bayesian classifier as our basic classifier, and modify it to a personalized ranking model.

Basic concept: Bayesian classifier

A classification problem is the following [33]: given a set of training instances, each described with a set of n attributes and each belonging to exactly one of a certain number of possible classes, learn to classify new, unseen objects. In addition, each attribute has a fixed number of possible values. We use naive Bayesian classifier as our basic classifier in this paper, since it evaluates directly the probability of taking a test and the conditional probability among two tests. Moreover, naive Bayesian is easy to construct and has surprisingly good performance in classification, even though the conditional independence assumption is rarely true in real-world applications [34]. The probability model for a classifier is a conditional model

Pr ( C | F 1 , . . . , F n )
(1)

where F1, ..., F n are attributes, and C is a class variable. By Bayesian criteria, it equals to

Pr ( C ) Pr ( F 1 , . . . , F n | C ) Pr ( F 1 , . . . , F n )
(2)

The denominator is effectively constant, and the numerator is equivalent to the joint probability model

Pr ( C , F 1 , . . . , F n ) =  Pr C P r ( F 1 | C )  Pr ( F 2 | C , F 1 )  Pr ( F 3 | C , F 1 , F 2 ) . . . Pr ( F n | C , F 1 , . . . , F n - 1 )

In naive Bayesian, it assumes the features are conditional independent

Pr ( F i | C , F j , 0 ) = Pr ( F i | C ) , f o r i j

Therefore, the probability of a class C given feature F1, ..., F n is

Pr ( C | F 1 , . . . , F n ) = A Pr ( C ) i = 1 n ( F i | C )
(3)

where A= 1 Pr ( F 1 , . . . , F n ) is a constant.

The weighting function of BPLT model

In this Section, we describe the Bayesian-based Personalized Laboratory Tests recommendation (BPLT) model, which was proposed in our previous work [1]. More details are given in this paper. The purpose of BPLT model is to classify the laboratory tests for individual patients by their personal conditions. In the real world, it is often easier to obtain the patients' previous laboratory tests information. Therefore, the BPLT model recommends additional new laboratory tests to patients, given the previous laboratory tests that the patients have taken.

Suppose we have a set of M laboratory tests T = {test1, test2, ..., test M }, and a patient p j who has taken tests T j = {test j ,1, ..., test j,kj } where test j,i T for all 1 ≤ i ≤ k j . We denote the events that tests in T j are taken by p j as F j ,1, F j ,2, ...F j,M . For example, if we have 7 tests in T, and p j has taken test3, test5 and test7 could be represented as (F j ,1, F j ,2, ..., F j ,7) = (0, 0, 1, 0, 1, 0, 1). Bayesian Classifier is employed to evaluate the association between p j a new test test0 where test0 T and test0 T j . We use Fj,0 to represent the event of p j should take t0, and F j , 0 c to represent the event of p j should not take t0. By Formula (3), the probability of F j ,0 given F j ,1, F j,2 , ...F j,M is

Pr ( F j , 0 | F j , 1 , F j , 2 , F j , M ) Pr ( F j , 0 ) i = 1 M Pr ( F j , i | F j , 0 )

The probability of F j , 0 c given F j ,1, F j ,2, ... F j,M is

Pr ( F j , 0 c | F j , 1 , F j , 2 , F j , M ) Pr ( F j , 0 c ) i = 1 M Pr ( F j , i | F j , 0 c )

In the BPLT model, we reward the tests with high probability of "taken" and low probability of "not-taken". The correlation between a new test test0 and a given patient p j is shown in Definition 1 [1].

Definition 1 The correlation between a new test test0 and a given patient p j is defined as the log function of the probability of p j should take test 0 divided by the probability of p j should not take test 0 given Fj,1 , F j ,2, ... F j,M .

c o r r ( t e s t 0 , p j ) = log Pr ( F j , 0 | F j , 1 , F j , 2 , F j , M ) Pr ( F j , 0 c | F j , 1 , F j , 2 , F j , M )
(4)

We can see that higher value of corr(test0, p j ) indicates that test0 has more association with p j . The calculation of corr(test0, p j ) can be further simplified as follows

c o r r ( t e s t 0 , p j ) = log Pr ( F j , 0 | F j , 1 , F j , 2 , F j , M ) - log Pr ( F j , 0 c | F j , 1 , F j , 2 , F j , M ) = log Pr ( F j , 0 ) i = 1 M Pr ( F j , i | F j , 0 ) - log Pr ( F j , 0 c ) i = 1 M Pr ( F j , i | F j , 0 c ) = log Pr ( F j , 0 ) Pr ( F j , 0 c ) + i = 1 M log Pr ( F j , i | F j , 0 ) Pr ( F j , i | F j , 0 c )
(5)

Moreover, a test either belongs to a "taken" class or a 'not taken" class. Thus, the following two formulas are held.

Pr ( F j , 0 ) + Pr ( F j , 0 c ) = 1 Pr ( F j , i | F j , 0 ) Pr ( F j , 0 ) + Pr ( F j , i | F j , 0 c ) Pr ( F j , 0 c ) = Pr ( F j , i )

from which we can obtain Pr ( F j , 0 c ) and Pr ( F j , i | F j , 0 c )

Pr ( F j , 0 c ) = 1 - Pr ( F j , 0 ) Pr ( F j , i | F j , 0 c ) = Pr ( F j , i ) - Pr ( F j , i | F j , 0 ) Pr ( F j , 0 ) 1 - Pr ( F j , 0 )

Thus Pr ( F j , 0 c ) and Pr ( F j , i | F j , 0 c ) in (5) can be eliminated in corr (test0, p j ), as shown below

log Pr ( F j , 0 ) 1 - Pr ( F j , 0 ) + i = 1 M log Pr ( F j , i | F j , 0 ) ( 1 - Pr ( F j , 0 ) ) Pr ( F j , i ) - Pr ( F j , i | F j , 0 ) - Pr ( F j , 0 )

A joint probability for patient p j take both of the tests test i and test0 is

Pr ( F j , i , F j , 0 ) = Pr ( F j , i | F j , 0 ) Pr ( F j , 0 )

The definition of the correlation between test0 and p j is

c o r r ( t e s t 0 , p j ) = log Pr ( F j , 0 ) 1 - Pr ( F j , 0 ) + i = 1 M log Pr ( F j , i , F j , 0 ) ( 1 - Pr ( F j , 0 ) ) Pr ( F j , 0 ) ( Pr ( F j , i ) - Pr ( F j , i , F j , 0 ) ) = ( k - 1 ) log 1 - Pr ( F j , 0 ) Pr ( F j , 0 ) + i = 1 M log Pr ( F j , i | F j , 0 ) Pr ( F j , i ) - Pr ( F j , i | F j , 0 )

which leads to the following Definition 2 [1].

Definition 2 The weighting function for a laboratory test test 0 for a patient p j is the simplified correlation between test 0 and p j

w ( t e s t 0 , p j ) = ( k - 1 ) log 1 - α α + i = 1 M log β j , i γ j , i - β j , i
(6)

where

α = Pr ( F j , 0 ) = n u m b e r o f p a t i e n t s t a k e n t e s t 0 n u m b e r o f p a t i e n t s γ j , i = Pr ( F j , i ) = n u m b e r o f p a t i e n t s t h a t F j , i h o l d s n u m b e r o f p a t i e n t s β j , i = Pr ( F j , i | F j , 0 ) = Pr ( F j , i , F j , 0 ) Pr ( F j , 0 ) = 1 α n u m b e r o f p a t i e n t s t h a t b o t h F j , 0 a n d F j , i h o l d s n u m b e r o f p a t i e n t s

The new laboratory tests will be ranked in a list according to w(test0, p j ) for a given patient p j . In the later section, we will present the evaluation environments for the laboratory test ranking list.

An advanced model: BPLT+

To have a more robust and better performance model, we further propose an advanced model, BPLT+, by improving the BPLT model using several smoothing techniques. There are two reasons for smoothing BPLT. One reason is that smoothing is a way to deal with noise within the data. Another reason is to avoid the mathematically meaningless. When test0 laboratory test has not been observed in the previous visits, which means α = 0, the first part of formula (6) will become an irrational number. Meanwhile, when the joint frequency of two laboratory tests is zero, which means β j , i = 0, the second part of (6) will become an irrational number. Therefore, we introduce smoothing technologies to further improve BPLT model.

Smoothing techniques

In statistics, smoothing [27] is a technique to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. The main purpose of smoothing in this paper is to assign a non-zero probability to the unseen tests and improve the accuracy of test probability estimation in general.

The smoothing techniques are discussed based on the following definitions of a conditional probability [28].

Pr ( t | p ) = c ( t ; p ) t T c ( t ; p )
(7)

where c(t;p) is the count of a patient taking a test. Here are some commonly used smoothing methods. Since we have defined a ranking problem, which is similar to the problems in Information Retrieval (IR), we use some widely used smoothing methods in language model in IR. The general form of a smoothed model [35] is assumed to be the following:

Pr ( t | p ) = Pr t ( t | p ) i f t e s t t i s o b s e r v e d Pr ( t | C ) o t h e r w i s e
(8)

where Pr t (t|p) is the smoothed probability of a test t given the patient with existing tests. Pr(t|C) is the probability of a test t given the whole data set.

A smoothing method may be as simple as adding an extra count to every test, which is called additive or Laplace smoothing, or more sophisticated as in Katz smoothing, where tests of different count are treated differently. Three representative methods that are popular and effective are:

  • The Jelinek-Mercer method

    Pr λ ( t | p ) = ( 1 - λ ) Pr ( t | p ) + λ Pr ( t | C )
    (9)

where λ is a balancing parameter ranges from 0 to 1.

  • Bayesian Smoothing using Dirichlet Priors

    Pr μ 0 ( t | p ) = c ( t ; p ) + μ 0 Pr ( t | C ) t T c ( t ; p ) + μ 0
    (10)

where µ0 is a balancing parameter, and µ0 >0. The Laplace method is a special case of this technique.

  • Absolute Discounting

    Pr δ ( t | p ) = m a x ( c ( t ; p ) - δ , 0 ) t T c ( t ; p ) + σ p ( t | C )
    (11)

where δ [0, 1] is a discount constant and σ = δ|p| u /|p|, so that all probabilities sum to one. Here |p| u is the number of unique terms in document d, and |p| is the total count of words in the documents.

BPLT+with smoothing techniques

There are two parts in formula (6) that need smoothing. The first one is the conditional probability β j , i = Pr(F j,i |Fj,0). Its smoothed format is as follows:

  • BPLT+ with Jelinek-Mercer

    β j , i λ = ( 1 - λ ) β j , i + λ γ j , i
    (12)
  • BPLT+ with dirichlet priors

    β j , i μ = β j , i + μ γ j , i 1 + μ
    (13)
  • BPLT+ with absolute discounting

    β j , i δ = m a x ( c ( t ; p ) - δ , 0 ) t T c ( t ; p ) + δ γ j , i
    (14)

In Jelinek-Mercer BPLT+ and Absolute Discounting BPLT+, we use the existing smoothing method. The smoothing parameters λ, δ are within the range of [0, 1]. In Dirichlet Priors BPLT+, we modify the Dirichlet smoothing technique, by divide both the numerator and the denominator in (10) by t T c ( t ; p ) , and normalize the parameter µ to the range of 0[1], where μ= μ 0 t T c ( t ; p ) .

Another part in formula (6) needs smoothing is log α 1 - α , which is a simple division that could be smoothed

via Laplace smoothing as

log ( α + θ 1 - α + θ )
(15)

where θ is a tuning parameter ranges from 0 to 1.

Evaluation environments

Datasets

The datasets in our experiment are obtained from Alpha Global IT [1, 36]. Alpha Corporate Group provides laboratory, medical clinic, commercial electronic medical record and practice management software. The data set contains 78 monthly patient's laboratory test results. Our experiments use 6 month results, containing 1,048,575 patients' records, as a key study. Thousands of patients' records and more than 400 laboratory tests are included in our experiments. The data format is the same as the example shown in Figure 1. Our data set contains real patients' information, such as health card ID, age, gender, date of visit, laboratory test ID, laboratory test results. We only use the patient ID and laboratory ID attributes in this paper, and analyze the associations among these laboratory tests. In our future work, we will incorporate more attributes in the laboratory recommendation model.

Validation data and measure

To evaluate BPLT+ models objectively, we divide the data set into three components: a training set, a validation input set, and a validation label set. The data set is firstly randomly split into a training set and a validation set. In this step, we split based on the patients and do not split the records from a same patient. Then for the validation set, we randomly remove one test t* from each patient p j , and store the t* in the validation label set. The ranked list returned by BPLT+ will be compared with t* for each patient. To measure such comparison and finally evaluate the effectiveness of BPLT+, we use the following defined CorrectRate X [1]. Suppose the returned laboratory ranking list is L= t 1 , j , t l , j , CorrectRate X validates whether t* appears in the top ranked tests. The measure is modified from Mean Average Precision (MAP) [32] evaluation metric.

Definition 3 The CorrectRate X evaluates the accuracy of a laboratory tests prediction system. It is the number of patients with the desired (golden standard) test matching one of the top X tests generated by the system, divided by the total number of the patients.

C o r r e c t R a t e X = j = 1 n T O P j , X n
(16)

where

T O P j , X = 1 i f t * m a t c h e s a t e s t i n { t 1 , j , t X , j } 0 o t h e r w i s e

n is the number of patients, X is a parameter indicating how many top tests are compared to the golden standard test t*, which is set to be 1 or 3 in this paper.

We present an example to show how the CorrectRate X evaluates the model in Table 1. Suppose the laboratory test sets includes 200 tests and there are 5 patients in the validation set. As we have introduced, the BPLT+ model returns a ranked list for each patient. Here ">" represents that the weight of the left-side laboratory test is higher than the weight of the right-side laboratory test. In our example, 2 out of 5 patients have the desired test t* ranked in the top 1 position of the list, then CorrectRate1 equals 0.4. And 4 out of 5 patients have t* appears within the top 3 positions of the returned ranking list, then CorrectRate3 equals 0.8. We can see that the top 3 positions include the top 1 position, so the following statement is always true: CorrectRate1 ≤ CorrectRate3.

Table 1 An example of CorrectRate X

BPLT+System Framework

The framework of BPLT+ Model is shown in Figure 2. The data set in this framework is abstracted to contain only patient ID and laboratory test ID. The procedures in the proposed framework are described as follows.

Figure 2
figure2

BPLT+ System Framework. The procedures for processing the laboratory data and testing the BPLT+ model are shown: (1) the rectangles represent the data sets; (2) the rounded rectangles present the implemented procedures; (3) the ovals show the personalized laboratory model; (4) the lines with arrows determine the directions through the framework.

Split: First the data set is randomly split into a training set and a validation set.

Random Remove a test as label: Since it is hard to objectively evaluate the performance of the BPLT+ model, we further randomly remove a test for each visit of the patients from the validation set. These removed tests are regarded as labels of the validation set input. Our ultimate goal is to recommend the missing test for a patient's visit.

Build training matrix: To avoid duplicate calculating the frequency of a test and the joint frequency between two tests, we build a training matrix out of the training data. This training matrix contains the frequency of co-occurrences of two laboratory tests. For example, if a patient in the training data did test1 and test2 together, then add 1 to F12 and F21. We can see that the training matrix is a symmetric matrix.

BPLT+model: The correlation of a given test0 and a patient is calculated based on formula (6).

Evaluation via CorrectRate X : Finally, the evaluation criteria CorrectRate X evaluates if the model made the correct recommendations.

Results

We first show the overall performance under different training-validation proportion in Table 2[1]. We randomly take 40%, 50% and 60% of the data out of the raw data set as the training data and keep the rest as the validation data. In general, there is higher performance of BPLT+ model on a larger training data set. This is because the larger training data set contains more information, and more knowledge can be learned. With the development of computer technology, larger amount of medical data will be available in practice. Therefore, we will use 60% of data as training data in the rest of this paper. As we have discussed before, CorrectRate3 is always higher than CorrectRate1. In general, the BPLT+ model has promising performance with an accuracy of 0.7074 for CorrectRate1 and an accuracy of 0.7840 for CorrectRate3.

Table 2 Performance

Then we investigate how the smoothing parameters affect the effectiveness in detail. We first consider smoothing β j,i only. There are three smoothing technologies utilized to smooth β j,i . They are Jelinek-Mercer BPLT+, Dirichlet Priors BPLT+ and Absolute Discounting BPLT+, with the corresponding parameters: λ, µ, δ [0, 1]. We conduct experiments on these three methods individually. The change of CorrectRate1 and CorrectRate3 with respect to the parameters are shown in Figure 3, Figure 4, and Figure 5. We can see from the figures that the curve of CorrectRate1 is always below the curve of CorrectRate3, which is consistent as we have discussed Definition 3. With the increasing of parameters from 0.1 to 1, both CorrectRate1 and CorrectRate3 become higher at the beginning due to the incorporating of the smoothing portion. After reaching the maximum value, CorrectRate1 and CorrectRate3 become lower, since the weighing would tend to be more universal when too much smoothing is incorporated. All the smoothing parameters achieve their best performance at the value of 0.2. Comparing among these three methods, Jelinek-Mercer BPLT+ obtains the best performance on both CorrectRate1 and CorrectRate3, which are 0.5569 and 0.6167. When it comes to the average value, Dirichlet Priors BPLT+'s average performance on CorrectRate3 is better than the other two, and Jelinek-Mercer BPLT+'s average performance on CorrectRate1 is the best.

Figure 3
figure3

Parameter Sensitivity of λ in Jelinek-Mercer BPLT+. The influence of parameter λ is investigated: (1) the stars represent the performance of Jelinek-Mercer BPLT+ under the evaluation metric CorrectRate1; (2) the circles represent the performance of Jelinek-Mercer BPLT+ under the evaluation metric CorrectRate3; (3) CorrectRate3 is always higher than CorrectRate1; (4) Jelinek-Mercer BPLT+ achieves its best performance when λ = 0.2.

Figure 4
figure4

Parameter Sensitivity of µ in Dirichlet Priors BPLT+. The influence of parameter µ is studied: (1) the stars represent the performance of Dirichlet Priors BPLT+ under the evaluation metric CorrectRate1; (2) the circles represent the performance of Dirichlet Priors BPLT+ under the evaluation metric CorrectRate3; (3) Dirichlet Priors BPLT+ achieves its best performance when µ = 0.2.

Figure 5
figure5

Parameter Sensitivity of δ in Absolute Discounting BPLT+. The influence of parameter δ is investigated: (1) the stars represent the performance of Absolute Discounting BPLT+ under the evaluation metric CorrectRate1; (2) the circles represent the performance of Absolute Discounting BPLT+ under the evaluation metric CorrectRate3; (3) Absolute Discounting BPLT+ achieves its best performance when δ = 0.2.

We further discuss to smooth the second part of (6), where the Laplace smoothing parameter is θ. As we have discussed before, Jelinek-Mercer BPLT+ has the best performance on both CorrectRate1 and CorrectRate3. We focus on investigating the sensitivity of θ by fixing Jelinek-Mercer BPLT+ with λ = 0.2. The results are shown in Figure 6. We can see that the CorrectRate1 increases while θ is increasing, and the CorrectRate3 decreases a little and then increases. Both of them reach the maximum and tend to be stable when θ is greater than 0.5.

Figure 6
figure6

Parameter Sensitivity of θ. The influence of parameter θ is presented: (1) we use the best smoothing technique for the first part in Formula 6, which is Jelinek-Mercer BPLT+; (2) the smoothing parameter λ is set to be optimal; (2) the stars represent the results of Jelinek-Mercer BPLT+ under evaluation metric CorrectRate1; (3) the circles represent the results of Jelinek-Mercer BPLT+ under evaluation metric CorrectRate3; (4) both metrics reach the maximum and tend to be stable when θ is greater than 0.5.

Conclusions and future work

An Advanced Bayesian based Personalized Laboratory Tests recommendation (BPLT+) model is proposed in this paper. Based on the assumption that hidden association could exist among laboratory tests, we employ a Bayesian approach to build a weighting function for scoring the correlation between a new laboratory test and a patient. To have a more robust and better performance model, we employ several enhanced smoothing technologies into the BPLT+ model. The main purpose of smoothing in this paper is to assign a non-zero probability to the unseen laboratory tests and improve the accuracy of test probability estimation. We integrate existing smoothing techniques in the BPLT+ model. In particular, we use three techniques, Jelinek-Mercer, Dirichlet Priors and Absolute Discounting approaches, to smooth the conditional probability of observing a patient taking an existing test when a new test test0 is given (Formula 12-14). Also we use Laplace method to smooth the log function in the BPLT+ model (Formula 15). We conducted experiments to discuss the performance of the BPLT+ model and the sensitivity of smoothing parameters. We find that BPLT+ is able to make accurate recommendations under proper smoothing parameters.

Further, we propose a novel framework for effectively implementing BPLT+ model and objectively testing personalized recommendation systems without human interactions, shown in Figure 2. Based on the real patients' laboratory test data, we randomly generate a training data set, a validation input set and a validation label set. A training matrix containing the laboratory test statistics is calculated from the training data set and stored. For new patients (the validation input set), instead of processing the original training set, we utilize this training matrix to predict the laboratory test on the validation input set, and compare the ranking results with the validation label set.

There are a few future directions of this research work. As we can see from the data format in Figure 1, we have not make use of all the attributes. In the future, we would like to conduct a comprehensive investigation for the patients' profiles. For example, we can cluster the patients into groups and investigate the similarities of the patients in the same group. We can also analyze the associations among laboratory test results and therefore further enhance our proposed personalized recommendation model. Moreover, we look forward to testing our proposed models in more real applications.

References

  1. 1.

    Zhao J, Huang JX, Hu X, Kurian J, Melek W: A Bayesian-based prediction model for personalized medical health care. Bioinformatics and Biomedicine (BIBM). 2012, 1-4. 10.1109/BIBM.2012.6392623. IEEE International Conference on: 4-7 October 2012

    Google Scholar 

  2. 2.

    Bates D, Cohen M, Leape L, Overhage J, Shabot M, Sheridan T: Reducing the frequency of errors in medicine using information technology. Journal of the American Medical Informatics Association. 2001, 8 (4): 299-308. 10.1136/jamia.2001.0080299.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  3. 3.

    Ogiela L, Tadeusiewicz R, Ogiela M: Cognitive techniques in medical information systems. Computers in Biology and Medicine. 2008, 38 (4): 501-507. 10.1016/j.compbiomed.2008.01.017.

    Article  PubMed  Google Scholar 

  4. 4.

    Shortliffe E, Cimino J: Biomedical informatics: computer applications in health care and biomedicine. 2006, Springer

    Book  Google Scholar 

  5. 5.

    Melski J, Geer D, Bleich H: Medical information storage and retrieval using preprocessed variables. Computers and Biomedical Research, An International Journal. 1978, 11 (6): 613-10.1016/0010-4809(78)90038-1.

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Thoma G, Suthasinekul S, Walker F, Cookson J, Rashidian M: A prototype system for the electronic storage and retrieval of document images. ACM Transactions on Information Systems. 1985, 3 (3): 279-291. 10.1145/4229.4232.

    Article  Google Scholar 

  7. 7.

    Frick S, Uehlinger D, Zenklusen R: Medical futility: Predicting outcome of intensive care unit patients by nurses and doctors-A prospective comparative study*. Critical Care Medicine. 2003, 31 (2): 456-461. 10.1097/01.CCM.0000049945.69373.7C.

    Article  PubMed  Google Scholar 

  8. 8.

    Wu W, Bui A, Batalin M, Au L, Binney J, Kaiser W: MEDIC: Medical embedded device for individualized care. Artificial Intelligence in Medicine. 2008, 42 (2): 137-152. 10.1016/j.artmed.2007.11.006.

    Article  PubMed  Google Scholar 

  9. 9.

    Kajíc V, Esmaeelpour M, Považay B, Marshall D, Rosin P, Drexler W: Automated choroidal segmentation of 1060 nm OCT in healthy and pathologic eyes using a statistical model. Biomedical Optics Express. 2012, 3 (1): 86-103. 10.1364/BOE.3.000086.

    PubMed Central  Article  PubMed  Google Scholar 

  10. 10.

    Kokol P, Pohorec S, Štiglic G, Podgorelec V: Evolutionary design of decision trees for medical application. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012, 2 (3): 237-254. 10.1002/widm.1056.

    Google Scholar 

  11. 11.

    Pepe M: The statistical evaluation of medical tests for classification and prediction. 2004, Oxford University Press, USA

    Google Scholar 

  12. 12.

    Rohian H, An A, Zhao J, Huang X: Discovering temporal associations among significant changes in gene expression. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine, IEEE. 2009, 419-423.

    Google Scholar 

  13. 13.

    Lupu M, Huang XJ, Zhu J: TREC Chemical Information Retrieval - An Initial Evaluation Effort for Chemical IR Systems. World Patent Information Journal. 2011, 33 (3): 248-256. 10.1016/j.wpi.2011.03.002.

    CAS  Article  Google Scholar 

  14. 14.

    Zhao J, Huang X, Ye Z, Zhu J.: York University at TREC 2009: Chemical Track. Proceedings of the 18th Text REtrieval Conference. 2009

    Google Scholar 

  15. 15.

    Bernardo J, Smith A: Bayesian theory. Measurement Science and Technology. 2001, 12: 221-222.

    Google Scholar 

  16. 16.

    Chen J, Huang H, Tian F, Tian S: A selective bayes classifier for classifying incomplete data based on gain ratio. Knowledge-Based Systems. 2008, 21 (7): 530-534. 10.1016/j.knosys.2008.03.013.

    Article  Google Scholar 

  17. 17.

    Clèries R, Ribes J, Buxo M, Ameijide A, Marcos-Gragera R, Galceran J, Martínez J, Yasui Y: Bayesian approach to predicting cancer incidence for an area without cancer registration by using cancer incidence data from nearby areas. Statistics in Medicine. 2012

    Google Scholar 

  18. 18.

    Huang X, Hu Q: A Bayesian Learning Approach to Promoting Diversity in Ranking for Biomedical Information Retrieval. Proceedings of the 32nd Annual International Conference on Research and Development in Information Retrieval. 2009, 19-23.

    Google Scholar 

  19. 19.

    Liechty J, Liechty M, Muller P: Bayesian correlation estimation. Biometrika. 2004, 91: 1-10.1093/biomet/91.1.1.

    Article  Google Scholar 

  20. 20.

    Babashzadeh A, Daoud M, Huang J: Using semantic-based association rule mining for improving clinical text retrieval. Health Information Science. 2013, 186-197.

    Chapter  Google Scholar 

  21. 21.

    Raymer M, Doom T, Kuhn L, Punch W: Knowledge discovery in medical and biological datasets using a hybrid bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B. 2003, 33 (5): 802-813. 10.1109/TSMCB.2003.816922.

    CAS  Article  Google Scholar 

  22. 22.

    Martín J, Pérez C, Muller P: Bayesian robustness for decision making problems: Applications in medical contexts. International Journal of Approximate Reasoning. 2009, 50 (2): 315-323. 10.1016/j.ijar.2008.03.017.

    Article  Google Scholar 

  23. 23.

    Hu Q, Huang X: Passage Extraction and Result Combination for Genomics Information Retrieval. Journal of Intelligent Information Systems. 2010, 34 (3): 249-274. 10.1007/s10844-009-0097-4.

    Article  Google Scholar 

  24. 24.

    Zhao J, Huang JX, He B: CRTER: using cross terms to enhance probabilistic information retrieval. Proceedings of the 34th international ACM SIGIR conference, ACM. 2011, 155-164.

    Google Scholar 

  25. 25.

    Yin X, Huang JX, Li Z, Zhou X: A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia. IEEE Transactions on Knowledge and Data Engineering. 2013, 25 (6): 1201-1212.

    Article  Google Scholar 

  26. 26.

    Liu Y, Huang JX, An A: Personalized recommendation with adaptive mixture of markov models. Journal of the American Society for Information Science and Technology. 2007, 58 (12): 1851-1870. 10.1002/asi.20631.

    Article  Google Scholar 

  27. 27.

    Titterington D: Common structure of smoothing techniques in statistics. International Statistical Review/Revue Internationale de Statistique. 1985, 141-170.

    Google Scholar 

  28. 28.

    Zhai C, Lafferty J: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems. 2004, 22 (2): 179-214. 10.1145/984321.984322.

    Article  Google Scholar 

  29. 29.

    Kononenko I: Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine. 2001, 23: 89-109. 10.1016/S0933-3657(01)00077-X.

    CAS  Article  PubMed  Google Scholar 

  30. 30.

    Donabedian A: Evaluating the quality of medical care. Milbank Quarterly. 2005, 83 (4): 691-10.1111/j.1468-0009.2005.00397.x.

    PubMed Central  Article  PubMed  Google Scholar 

  31. 31.

    Cook N: Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clinical chemistry. 2008, 54: 17-

    CAS  Article  PubMed  Google Scholar 

  32. 32.

    Sanderson M: Information retrieval system evaluation: effort, sensitivity, and reliability. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. 2005, 162-169.

    Google Scholar 

  33. 33.

    Kononenko I: Inductive and bayesian learning in medical diagnosis. Applied Artificial Intelligence. 1993, 7 (4): 317-337. 10.1080/08839519308949993.

    Article  Google Scholar 

  34. 34.

    Zhang H, Su J: Naive bayesian classifiers for ranking. Machine Learning: ECML. 2004, 501-512.

    Google Scholar 

  35. 35.

    Chen S, Goodman J: An empirical study of smoothing techniques for language modeling. Computer Speech and Language. 1999, 13 (4): 359-394. 10.1006/csla.1999.0128.

    Article  Google Scholar 

  36. 36.

    Alpha Global IT: [http://www.alpha-it.com/]

Download references

Acknowledgements

This research is supported in part by the research grant from the Natural Sciences & Engineering Research Council (NSERC) of Canada and the Early Research Award/Premier's Research Excellence Award. The authors thank Dr. Joseph Kurian and Dr. William Melek from Alpha Global IT for their help and providing the data. In particular, we thank anonymous reviewers for their valuable and detailed comments on this paper.

Based on "A Bayesian-based prediction model for personalized medical health care", by Jiashu Zhao, Jimmy Xiangji Huang, Xiaohua Hu, C Joseph Kurian, and William Melek which appeared in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. ©2012 IEEE, 579-582.

Declarations

The publication costs for this article were funded by the corresponding author.

This article has been published as part of BMC Genomics Volume 14 Supplement S4, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S4.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jimmy Xiangji Huang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JZ proposed BPLT+ model, carried on the experiments and drafted the manuscript. JXH supervised the project and revised the manuscript. JXH also contributed in the study design and experiments. XH provides useful feedback. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Cite this article

Zhao, J., Huang, J.X. & Hu, X. BPLT+: A Bayesian-based personalized recommendation model for health care. BMC Genomics 14, S6 (2013). https://doi.org/10.1186/1471-2164-14-S4-S6

Download citation

Keywords

  • Smoothing Parameter
  • Mean Average Precision
  • Bayesian Classifier
  • Ranking List
  • Smoothing Technique