How to deal with multicollinearity in Logistic Regression and LDA?

12 views (last 30 days)
Hi everybody! I have to apply a Multinomial Logistic Regression Classifier and a LDA classifier on a dataset. However, the dataset contains two columns those are linear dependent because they are all composed by zeros. As a consequence, the function B=mnrfit(dataset,classes) that I used returns me a matrix B composed by NaN elements because the matrix dataset is not invertible due to the fact that is not a full-rank matrix.
My idea was to overcome the multicollinearity by delecting the linearly dependent columns and making the dataset an invertible matrix. However, this could lead to a loss of information and to an overfitted model.
There is any other way to overcome the multicollinearity of a dataset and applying logistic regression?
I have the same problem with the LDA, in particular I used the following code:
LDAmodel = fitcdiscr(X,classes,'DiscrimType','pseudolinear');
[W, LAMBDA] = eig(LDAmodel.BetweenSigma, LDAmodel.Sigma);
lambda = diag(LAMBDA);
[lambda, SortOrder] = sort(lambda, 'descend');
W = W(:, SortOrder);
Z = X*W;
model=fitcnb(Z,classes);
Using the pseudolinear parameter, LDA works even on a non-invertible matrix, but when I use the function fitcnb MATLAB returns the following errors:
Error using ClassificationNaiveBayes/findNoDataCombos
A normal distribution cannot be fit for the combination of class 0 and predictor x1. The data has zero
variance.
Error in ClassificationNaiveBayes/fitNonMNDists (line 320)
distParams = findNoDataCombos(this);
Error in ClassificationNaiveBayes (line 108)
this.DistributionParameters = fitNonMNDists(this);
Error in classreg.learning.FitTemplate/fit (line 291)
[varargout{1:nargout}] = this.MakeFitObject(X,Y,W,this.ModelParams,fitArgs{:});
Error in ClassificationNaiveBayes.fit (line 222)
this = fit(temp,X,Y);
Error in fitcnb (line 250)
this = ClassificationNaiveBayes.fit(X,Y,RemainingArgs{:});
Error in Script (line 137)
scalarmodel=fitcnb(Ztr,Wtr);
How can I overcome this problem with LDA?
  3 Comments
Picasso
Picasso on 1 Feb 2023
And what if the linearly dependent columns are not only composed by zeros, so they contain data? In this case the model will be overfitted if I delete these columns or it could work?
the cyclist
the cyclist on 1 Feb 2023
What you are describing is a symptom of a larger problem. If columns are linearly dependent, then that is an extreme case of the columns being correlated -- and that correlation violates an assumption of logistic regression.
In real-life problems, some correlation is inevitable. But larger correlations mean that you are not going to be able to interpret the coefficients. (Full linear dependence is the extreme case of that.)
Depending on what the overall correlation structure of all your explanatory variable looks like, you probably want to do some pre-processing steps. For example, you might want to use PCA or factor analysis on your explanatory variables, to reduce them to a set of variables that are completely (or at least mostly) uncorrelated.
If you don't really care about the interpretation of the coefficients of the regression, and only care about predictive power, you might instead want to move toward a full-blown machine learning model (which typically will not have an assumptions about multicollinearity).

Sign in to comment.

Answers (1)

Yoga
Yoga on 10 Mar 2023
There are several ways to overcome multicollinearity when applying logistic regression to a dataset. Here are a few approaches:
  1. Feature selection: One way to reduce multicollinearity is to select only the most important features for the logistic regression model. This can be done using techniques such as correlation analysis, feature importance ranking, or domain expertise.
  2. Regularization: Regularization techniques such as ridge regression and Lasso can also be used to overcome multicollinearity. These methods add a penalty term to the logistic regression objective function, which helps to reduce the magnitude of the coefficients and avoid overfitting.
  3. Principal component analysis (PCA): PCA is a dimensionality reduction technique that can be used to reduce the number of variables in the dataset while still capturing most of the variance in the data. By creating new variables that are linear combinations of the original variables, PCA can reduce the impact of multicollinearity on the logistic regression model.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!