Evaluation Criteria for Missing Data Imputation Techniques

9 views (last 30 days)
Hello,
I have 5 methods for missing data imputation, since my original data set, has missing values due to the fact that is industrial data. And to perform a PCA analysis, and in order to have eigenvalues positives, I need a covariance to be determine positive.
I use the 5 methods to impute missing data, so now i got 5 new matrices of X_imputed.
Question: How can measure the performance of each one? what criteria should I use?
I read about calculation RMSE, but when I see the formula they use SQRT of Xi obs - Xi imputed, and they do the calculation because their initial X is complete, and they introduce a % of MD, but the problem for me is that i already start with Missing Data.

Answers (2)

Jeff Miller
Jeff Miller on 4 Jul 2018
You can't evaluate the performance of the different imputaton methods with respect to your actual data set, for exactly the reason you mention. You can only compare their performance across simulations where you know the values of each of the missing points (i.e., your simulation pretends that some simulated points are missing). Such a simulation would require very detailed assumptions about the multivariate situation that your data came from, including the reasons why some points are missing.
It might be better to perform the PCA without imputing any missing data (check the pca documentation). Did you try
coeff = pca(X,'Rows','pairwise');
This essentially computes each entry in the covariance matrix using whichever of your original data rows/cases have values for both relevant variables.
  2 Comments
Tiago Dias
Tiago Dias on 4 Jul 2018
Thanks for your input, but I need to impute the missing data. Sice I got missing values (~30%, industrial data) i can make the calculation of the covariance, but since the covariance got NaN's, I can't calculate scores and loadings.
Since I got my matrix X and my matrix Ximputed (using a PCA model, so all the entry are re-calculate, even the non missing values) I can perform a
sum((X(i,j) - X_imp (i,j)).^2) has a criteria?
Jeff Miller
Jeff Miller on 5 Jul 2018
Sorry, I do not know whether your suggestion is reasonable or not.
If the data do not even allow the covariances to be estimated, then you probably don't have enough data to decide which is the best imputation method or to do PCA afterwards.
Can you select out a subset of the variables for which you can get a complete set of covariances? You might just do PCA on this subset.

Sign in to comment.


Tiago Dias
Tiago Dias on 5 Jul 2018
I can't really make a subset, because all variables have missing data. But I found an article when they do the residues from X(with MD) - Ximputed, just for the i,j that are values in X, so I go that way.

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!