Matching Feature ranking algoritum outputs in classification leaner
5 views (last 30 days)
Show older comments
Christopher McCausland
on 5 Nov 2023
Commented: Christopher McCausland
on 8 Nov 2023
Hi,
In 2023b within the feature selection algoritum tab of the classification learner app, you can generate feature ranks with five diffrent algoritums: MRMR, Chi2, ReliefF, ANOVA and Kruskal Wallis.
MRMR and Chi2 can be replicated with:
[idx,scores] = fscmrmr(randSamp(:,3:end),'Stage');
[idx,scores] = fscchi2(randSamp(:,3:end),'Stage');
Where randSamp is a table with some variables ignored at the start and 'Stage' is the lable of intrest.
However, I cannot figure out how to replicate the same with ANOVA and KW, I have tried something like this:
[idx,scores] = anova1(table2array(randSamp(:,4:end))',categorical(randSamp.Stage(:)));
[idx,scores] = kruskalwallis(table2array(randSamp(:,4:end))',categorical(randSamp.Stage(:)));
And while it done compute *something* I have no idea what it is doing or how to get it to match what the classification learner app is doing. Can anyone shed some light on this?
Christopher
0 Comments
Accepted Answer
Drew
on 6 Nov 2023
The short answer is that, for some feature ranking techniques, there is some normalization of the features before the ranking. This is by design, since some feature ranking techniques are particularly sensitive to normalization. To see how Classification Learner is ranking the features, use the "Generate Function" button in Classification Learner to generate code to replicate the feature selection.
For example, take these steps to see some example generated code:
(1) t=readtable("fisheriris.csv");
(2) Start Classification Learner, load the fisher iris data, take defaults at session start
(3) Rank features with Kruskal-Wallis, choose keeping the top three features
(4) Train the default tree model
(5) In the Export area of the toolstrip, choose "Generate Function".
Below is a section of code from the function generated by Classification Learner. Notice the calls to "standardizeMissing" and "normalize" in the first two lines of (non-comment) code. These functions are also used in the later cross-validation part of the code. So, for each training fold (or for all of the training data for the final model), the "standardizeMissing" function and the default "zscore" method of the "normalize" function are being used before ranking the features. Note: The normalization used before feature ranking is independent of any normalization (or no normalization) used before model training.
% Feature Ranking and Selection
% Replace Inf/-Inf values with NaN to prepare data for normalization
predictors = standardizeMissing(predictors, {Inf, -Inf});
% Normalize data for feature ranking
predictorMatrix = normalize(predictors, "DataVariable", ~isCategoricalPredictor);
newPredictorMatrix = zeros(size(predictorMatrix));
for i = 1:size(predictorMatrix, 2)
if isCategoricalPredictor(i)
newPredictorMatrix(:,i) = grp2idx(predictorMatrix{:,i});
else
newPredictorMatrix(:,i) = predictorMatrix{:,i};
end
end
predictorMatrix = newPredictorMatrix;
responseVector = grp2idx(response);
% Rank features using Kruskal Wallis algorithm
for i = 1:size(predictorMatrix, 2)
pValues(i) = kruskalwallis(...
predictorMatrix(:,i), ...
responseVector, ...
'off');
end
[~,featureIndex] = sort(-log(pValues), 'descend');
numFeaturesToKeep = 3;
includedPredictorNames = predictors.Properties.VariableNames(featureIndex(1:numFeaturesToKeep));
predictors = predictors(:,includedPredictorNames);
isCategoricalPredictor = isCategoricalPredictor(featureIndex(1:numFeaturesToKeep));
If this answer helps you, please remember to accept the answer.
More Answers (0)
See Also
Categories
Find more on Analysis of Variance and Covariance in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!