Imputation and classification of time series with missing data using machine learning

This work is about classifying time series with missing data with the help of imputation and selected machine learning algorithms and methods. The author has used imputation to replace missing values in two data sets, one containing surgical site infection (SSI) data of 11 types of blood samples of...

Full description

Saved in:

Bibliographic Details
Main Author:	Dretvik, Vilde Fonn
Format:	Dissertation
Language:	eng
Subjects:	Andre helsefag: 829 Anvendt matematikk: 413 Applied mathematics: 413 Health sciences: 800 Helsefag: 800 Informasjons- og kommunikasjonsvitenskap: 420 Information and communication science: 420 Knowledge based systems: 425 Kunnskapsbaserte systemer: 425 Matematikk og Naturvitenskap: 400 Matematikk: 410 Mathematics and natural science: 400 Mathematics: 410 Medical disciplines: 700 Medisinske Fag: 700 Other health science disciplines: 829 Statistics: 412 Statistikk: 412 VDP
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This work is about classifying time series with missing data with the help of imputation and selected machine learning algorithms and methods. The author has used imputation to replace missing values in two data sets, one containing surgical site infection (SSI) data of 11 types of blood samples of patients over 20 days, and another data set called uwave which contain 3D accelerometer data of several patterns made by a subset of people, where two patterns were selected. The SSI data set is known to possess informative missingness. For the uwave data, missing data was simulated by removing data points in an informative (not random) way to simulate missing data. The DTW and Euclidean distances were computed for each imputed data set to make distance grid matrices, and used to performed classification on the data using the K Nearest Neighbour (KNN) classifier and the Support Vector Machine (SVM) classifier. Furthermore the data set features were augmented by adding masks that indicate the presence of missing data and counters of consecutive spells of missing data to help exploit informative missingness. The augmented dataset was used to classify the data using the same classifiers and distance methods mentioned earlier, in addition to a newer classifier called the Temporal Convolution Network (TCN), which used the augmented data in combination with imputation of the original data. It was found that applying Dynamic Time Warping (DTW) was unnecessary for the KNN classifier, and that Euclidean distance was sufficient. Augmenting the data was found to improve the overall results for the SVM and KNN classifier. The TCN was found to need more work due to giving unstable test results with much lower values than the validation would imply.