Predicting Missing Values in Medical Data Via XGBoost Regression

The data in a patient’s laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or docum...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of healthcare informatics research 2020-12, Vol.4 (4), p.383-394
Main Authors:	Zhang, Xinmeng, Yan, Chao, Gao, Cheng, Malin, Bradley A., Chen, You
Format:	Article
Language:	eng
Subjects:	Biomedical Engineering and Bioengineering Computational Biology/Bioinformatics Computational Intelligence Computer Science Data Mining and Knowledge Discovery Health Informatics Research Article
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The data in a patient’s laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information (1) in individual or (2) between laboratory test variables. We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8200 patients’ records in the MIMIC-III dataset. The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average. Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.
ISSN:	2509-4971 2509-498X