Neural Imputations vs. Mean Imputations for Supervised Learning with Missing Data

Series

Research Master Defense
Location

Online

Date and time

May 14, 2024
14:00 - 15:30

In the context of supervised learning, missing values pose a serious problem as most
supervised learning algorithms cannot inherently deal with missing data. There exists a rich literature in missing value imputation for inference - the task of imputing missing values as accurately as possible. But only recently, theoretical results for imputing missing values such that the performance on the downstream supervised learning task is maximized have been established. These results suggest that there are benefits in developing new supervised learning algorithms that jointly learn the imputation and the prediction function. This should in theory ensure that both functions are smooth, continuous and thus easily learnable on a finite sample of training points. We put these assertions to test in a computational study with 10 real
datasets where we compare two deep neural networks that handle missing data natively against impute-then-regress procedures that first impute the missing values and then train a predictive model on the imputed dataset. We find that the recent theoretical results do not hold up in practice and mean imputation followed by training a predictive model is as performant as the newly developed deep imputation networks. This suggests that practitioners are better off sticking to simpler methods of mean imputation rather than investing time and resources into more complex deep imputation models.