Why does sequentialfs always outperform cross-validation with selected features?

1 view (last 30 days)
Why does classification accuracy obtained using sequentialfs and cross-validation always outperform a 10-fold cross-validation using those selected features? Any help would be gratefully received!
Thanks in advance.
Barry
See code below, Acc_fs (77%) is always higher than Acc (67%): This finding holds true for muliple tests - accuracy obtained using sequentialfs always outperforms cross validated accuracy. Is this a bug in my implementation or an issue with sequentialfs.m?
%************** Perform feature selection ************
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter');
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'options',opts);
Acc_fs = 1 - history.Crit(end);
%******* Cross validated classification accuracy *******
Feature_select = find(fs==1); % Features selected
Vars_select = Variables(fs==1); % Variable names of features selected
indices = crossvalind('Kfold',Labels,num_folds);
Results = classperf(Labels, 'Positive', 1, 'Negative', 0); % Initialize
for i = 1:num_folds
test = (indices == i); train = ~test;
svmStruct = svmtrain(Data(train,Feature_select),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);
class = svmclassify(svmStruct,Data(test,Feature_select));
classperf(Results,class,test);
end
Acc = Results.CorrectRate; % Classification accuracy
end
Function SVM_class_fun returns number of misclassified samples:
function MCE = SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint)
svmStruct = svmtrain(x_train,y_train,'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);
y_fit = svmclassify(svmStruct,x_test);
C = confusionmat(y_test,y_fit);
N = sum(sum(C));
MCE = N - sum(diag(C)); % No. misclassified sample
end

Accepted Answer

Ilya
Ilya on 24 Jan 2012
I don't know if your code is correct. But accuracy estimates obtained by sequential selection are always biased high.
Consider say 10 random variables. Suppose you wish to find the variable with the largest true mean. Suppose these random variables are identical. Generate a separate sample for each variable. Due to the finite sizes of the samples, their estimated means are not going to be equal. You then choose the sample with the largest average and believe that the respective variable has the largest true mean. But all you did was choose the variable whose estimated mean came out largest by chance. Since the estimated mean is largest, it is likely above the true mean. Then you generate another sample for the chosen variable. Because the true mean is less than the estimated mean, your new estimate is less than your previous estimate.
This is exactly why you need to re-estimate the accuracy by another run of cross-validation after selection is done.

More Answers (0)

Categories

Find more on Get Started with Statistics and Machine Learning Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!