Right measure for logistic regression of imbalance data and dealing with Complete Separation
10 views (last 30 days)
I have a highly imbalanced data set (ratio 1:150) with four predictors, where two are correlated. I attached the data as data.m below, you can also see the two figures below.
I would like to use logistic regression, and then validate it, in order to
- two compare it with a different model,
- check which predictors can be omitted
- check if the performance can be improved by combining features (feat1, feat1*feat2, etc.).
I also wanted to do undersampling to reduce the computational effort (I want to use the classifier in live application).
- Which measure should I use to check performance? There are too many (F-measure, Cohen's Kappa, Powers Informedness, AUC for ROC). I thought first about the AUC, because then I don't have to select a threshold like for the other measures. Or is the best method to use the sum of the error: (predicted label- classifier continuous output)^2.
- How would you reduce the computational effort? I thought about focused undersampling, instead of random undersampling, and keep class overlapping points. But I'm guessing this might lead to bias.
- To deal with the separation there is Firth penalized logistic regression as by Heinze2002 and bayesian logistic regression as in Gelman2008. Both are implemented in R ( logisticf and bayesglm ), which I'm not familiar. How can I deal with complete separation in Matlab? I tried to implement
Figure 1. Two features plotted against each other for the full data set:
Figure 2. Random undersampled data, leading to complete separation:
Ive J on 4 Jun 2022
It's probably a bit late for your original problem, but since it's an important question and MATLAB still lacks such important features, this Github repo has already implemented various penalized logistic regression methods (and much more):