Smooth Bayesian network model for the prediction of future high-cost patients with COPD

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

22 Scopus Citations
View graph of relations


  • Qingpeng Zhang
  • Li Luo
  • Lei Chen
  • Wei Zhang


Original languageEnglish
Pages (from-to)147-155
Journal / PublicationInternational Journal of Medical Informatics
Online published4 Apr 2019
Publication statusPublished - Jun 2019


Introduction: The clinical course of chronic obstructive pulmonary disease (COPD) is marked by acute exacerbation events that increase hospitalization rates and healthcare spending. The early identification of future high-cost patients with COPD may decrease healthcare spending by informing individualized interventions that prevent exacerbation events and decelerate disease progression. Existing studies of cost prediction of other chronic diseases have applied regression and machine-learning methods that cannot capture the complex causal relationships between COPD factors. Thus, the exploration of these factors through nonlinear, high-dimensional but explainable modeling is greatly needed. 
Objectives: We aimed to develop a machine-learning model to identify future high-cost patients with COPD. Such a model should incorporate expert knowledge about causal relationships, and the method for estimating the model could provide more accurate predictions than other machine learning methods. 
Methods: We used the 2011–2013 medical insurance data of patients with COPD in a large city. The data set included demographic information and admission records. Leveraging on developments in graphical modeling methods, we proposed a smooth Bayesian network (SBN) model for the prediction of high-cost individuals using medical insurance data. The modeling method incorporated some expert knowledge about causal relationships (i.e., about the Bayesian network structure). We employed a smoothing kernel based on the weighted nearest neighborhood method in the SBN model to address overfitting, case-mix effect, and data sparsity (i.e., using data about “similar patients”). 
Results: The proposed SBN achieved the area under curve (AUC) of 0.80 and showed considerable improvement over the baseline machine-learning methods. Besides confirming the known factors from the literature, we found “region” (i.e., a suburban or urban area) to be a significant factor, and that in a 3-tier system with primary, secondary and tertiary hospitals, COPD patients who had been admitted to primary hospitals were more likely to develop into future high-cost patients than patients who had been admitted to tertiary hospitals. 
Conclusion: The proposed SBN model not only obtained higher prediction accuracy and stronger generalizability than a number of benchmark machine-learning methods, but also used the Bayesian network to capture the complex causal relationships between different predictors by incorporating expert knowledge. Furthermore, a framework was developed to establish the relationships between exposure to historical trajectory and future outcome, which can also be applied to other temporal data to model different trajectory information and predict other outcomes.

Research Area(s)

  • Health informatics, COPD, Bayesian network, Machine learning, Graphical representation, Cost prediction, Temporal data, Data sparsity, Complex causal relationships, Smoothing