crossval

Cross-validate multiclass error-correcting output codes (ECOC) model

Syntax

CVMdl = crossval(Mdl)
CVMdl = crossval(Mdl,Name,Value)

Description

example

CVMdl = crossval(Mdl) returns a cross-validated (partitioned) multiclass error-correcting output codes (ECOC) model (CVMdl) from a trained ECOC model (Mdl). By default, crossval uses 10-fold cross-validation on the training data to create CVMdl, a ClassificationPartitionedECOC model.

example

CVMdl = crossval(Mdl,Name,Value) returns a partitioned ECOC model with additional options specified by one or more name-value pair arguments. For example, you can specify the number of folds or a holdout sample proportion.

Examples

collapse all

Cross-validate an ECOC classifier with SVM binary learners, and estimate the generalized classification error.

Load Fisher's iris data set. Specify the predictor data X and the response data Y.

load fisheriris
X = meas;
Y = species;
rng(1); % For reproducibility

Create an SVM template, and standardize the predictors.

t = templateSVM('Standardize',true)
t = 
Fit template for classification SVM.

                     Alpha: [0x1 double]
             BoxConstraint: []
                 CacheSize: []
             CachingMethod: ''
                ClipAlphas: []
    DeltaGradientTolerance: []
                   Epsilon: []
              GapTolerance: []
              KKTTolerance: []
            IterationLimit: []
            KernelFunction: ''
               KernelScale: []
              KernelOffset: []
     KernelPolynomialOrder: []
                  NumPrint: []
                        Nu: []
           OutlierFraction: []
          RemoveDuplicates: []
           ShrinkagePeriod: []
                    Solver: ''
           StandardizeData: 1
        SaveSupportVectors: []
            VerbosityLevel: []
                   Version: 2
                    Method: 'SVM'
                      Type: 'classification'

t is an SVM template. Most of the template object properties are empty. When training the ECOC classifier, the software sets the applicable properties to their default values.

Train the ECOC classifier, and specify the class order.

Mdl = fitcecoc(X,Y,'Learners',t,...
    'ClassNames',{'setosa','versicolor','virginica'});

Mdl is a ClassificationECOC classifier. You can access its properties using dot notation.

Cross-validate Mdl using 10-fold cross-validation.

CVMdl = crossval(Mdl);

CVMdl is a ClassificationPartitionedECOC cross-validated ECOC classifier.

Estimate the generalized classification error.

genError = kfoldLoss(CVMdl)
genError = 0.0400

The generalized classification error is 4%, which indicates that the ECOC classifier generalizes fairly well.

Consider the arrhythmia data set. This data set contains 16 classes, 13 of which are represented in the data. The first class indicates that the subject does not have arrhythmia, and the last class indicates that the arrhythmia state of the subject is not recorded. The other classes are ordinal levels indicating the severity of arrhythmia.

Train an ECOC classifier with a custom coding design specified by the description of the classes.

Load the arrhythmia data set. Convert Y to a categorical variable, and determine the number of classes.

load arrhythmia
Y = categorical(Y);
K = unique(Y); % Number of distinct classes

Construct a coding matrix that describes the nature of the classes.

OrdMat = designecoc(11,'ordinal');
nOrdMat = size(OrdMat);
class1VSOrd = [1; -ones(11,1); 0];
class1VSClass16 = [1; zeros(11,1); -1];
OrdVSClass16 = [0; ones(11,1); -1];
Coding = [class1VSOrd class1VSClass16 OrdVSClass16,...
    [zeros(1,nOrdMat(2)); OrdMat; zeros(1,nOrdMat(2))]];

Train an ECOC classifier using the custom coding design (Coding) and parallel computing. Specify an ensemble of 50 classification trees boosted using GentleBoost.

t = templateEnsemble('GentleBoost',50,'Tree');
options = statset('UseParallel',true);
Mdl = fitcecoc(X,Y,'Coding',Coding,'Learners',t,'Options',options);
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 6).

Mdl is a ClassificationECOC model. You can access its properties using dot notation.

Cross-validate Mdl using 8-fold cross-validation and parallel computing.

rng(1); % For reproducibility
CVMdl = crossval(Mdl,'Options',options,'KFold',8);
Warning: One or more folds do not contain points from all the groups.

Because some classes have low relative frequency, some folds do not train using observations from those classes. CVMdl is a ClassificationPartitionedECOC cross-validated ECOC model.

Estimate the generalization error using parallel computing.

error = kfoldLoss(CVMdl,'Options',options)
error = 0.3208

The cross-validated classification error is 32%, which indicates that this model does not generalize well. To improve the model, try training using a different boosting method, such as RobustBoost, or a different algorithm, such as SVM.

Input Arguments

collapse all

Full, trained multiclass ECOC model, specified as a ClassificationECOC model trained with fitcecoc.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: crossval(Mdl,'KFold',3) specifies using three folds in a cross-validated model.

Cross-validation partition, specified as the comma-separated pair consisting of 'CVPartition' and a cvpartition partition object created by cvpartition. The partition object specifies the type of cross-validation and the indexing for the training and validation sets.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: Suppose you create a random partition for 5-fold cross-validation on 500 observations by using cvp = cvpartition(500,'KFold',5). Then, you can specify the cross-validated model by using 'CVPartition',cvp.

Fraction of the data used for holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range (0,1). If you specify 'Holdout',p, then the software completes these steps:

  1. Randomly select and reserve p*100% of the data as validation data, and train the model using the rest of the data.

  2. Store the compact, trained model in the Trained property of the cross-validated model.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: 'Holdout',0.1

Data Types: double | single

Number of folds to use in a cross-validated model, specified as the comma-separated pair consisting of 'KFold' and a positive integer value greater than 1. If you specify 'KFold',k, then the software completes these steps:

  1. Randomly partition the data into k sets.

  2. For each set, reserve the set as validation data, and train the model using the other k – 1 sets.

  3. Store the k compact, trained models in the cells of a k-by-1 cell vector in the Trained property of the cross-validated model.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: 'KFold',5

Data Types: single | double

Leave-one-out cross-validation flag, specified as the comma-separated pair consisting of 'Leaveout' and 'on' or 'off'. If you specify 'Leaveout','on', then, for each of the n observations (where n is the number of observations excluding missing observations, specified in the NumObservations property of the model), the software completes these steps:

  1. Reserve the observation as validation data, and train the model using the other n – 1 observations.

  2. Store the n compact, trained models in the cells of an n-by-1 cell vector in the Trained property of the cross-validated model.

To create a cross-validated model, you can use one of these four name-value pair arguments only: CVPartition, Holdout, KFold, or Leaveout.

Example: 'Leaveout','on'

Estimation options, specified as the comma-separated pair consisting of 'Options' and a structure array returned by statset.

To invoke parallel computing:

  • You need a Parallel Computing Toolbox™ license.

  • Specify 'Options',statset('UseParallel',true).

Tips

  • Assess the predictive performance of Mdl on cross-validated data using the "kfold" methods and properties of CVMdl, such as kfoldLoss.

Alternative Functionality

Instead of training an ECOC model and then cross-validating it, you can create a cross-validated ECOC model directly by using fitcecoc and specifying one of these name-value pair arguments: 'CrossVal', 'CVPartition', 'Holdout', 'Leaveout', or 'KFold'.

Extended Capabilities

Introduced in R2014b