# crossval

Cross-validate machine learning model

## Description

specifies additional options using one or more name-value arguments. For example, you can
specify the fraction of data for holdout validation, and the number of folds to use in the
cross-validated model.`CVMdl`

= crossval(`Mdl`

,`Name=Value`

)

## Examples

### Cross-Validate SVM Classifier

Load the `ionosphere`

data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad (`'b'`

) or good (`'g'`

).

load ionosphere rng(1); % For reproducibility

Train a support vector machine (SVM) classifier. Standardize the predictor data and specify the order of the classes.

SVMModel = fitcsvm(X,Y,'Standardize',true,'ClassNames',{'b','g'});

`SVMModel`

is a trained `ClassificationSVM`

classifier. `'b'`

is the negative class and `'g'`

is the positive class.

Cross-validate the classifier using 10-fold cross-validation.

CVSVMModel = crossval(SVMModel)

CVSVMModel = ClassificationPartitionedModel CrossValidatedModel: 'SVM' PredictorNames: {'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14' 'x15' 'x16' 'x17' 'x18' 'x19' 'x20' 'x21' 'x22' 'x23' 'x24' 'x25' 'x26' 'x27' 'x28' 'x29' 'x30' 'x31' 'x32' 'x33' 'x34'} ResponseName: 'Y' NumObservations: 351 KFold: 10 Partition: [1x1 cvpartition] ClassNames: {'b' 'g'} ScoreTransform: 'none'

`CVSVMModel`

is a `ClassificationPartitionedModel`

cross-validated classifier. During cross-validation, the software completes these steps:

Randomly partition the data into 10 sets of equal size.

Train an SVM classifier on nine of the sets.

Repeat steps 1 and 2

*k*= 10 times. The software leaves out one partition each time and trains on the other nine partitions.Combine generalization statistics for each fold.

Display the first model in `CVSVMModel.Trained`

.

FirstModel = CVSVMModel.Trained{1}

FirstModel = CompactClassificationSVM ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' Alpha: [78x1 double] Bias: -0.2209 KernelParameters: [1x1 struct] Mu: [0.8888 0 0.6320 0.0406 0.5931 0.1205 0.5361 0.1286 0.5083 0.1879 0.4779 0.1567 0.3924 0.0875 0.3360 0.0789 0.3839 9.6066e-05 0.3562 -0.0308 0.3398 -0.0073 0.3590 -0.0628 0.4064 -0.0664 0.5535 -0.0749 0.3835 ... ] (1x34 double) Sigma: [0.3149 0 0.5033 0.4441 0.5255 0.4663 0.4987 0.5205 0.5040 0.4780 0.5649 0.4896 0.6293 0.4924 0.6606 0.4535 0.6133 0.4878 0.6250 0.5140 0.6075 0.5150 0.6068 0.5222 0.5729 0.5103 0.5061 0.5478 0.5712 0.5032 ... ] (1x34 double) SupportVectors: [78x34 double] SupportVectorLabels: [78x1 double]

`FirstModel`

is the first of the 10 trained classifiers. It is a `CompactClassificationSVM`

classifier.

You can estimate the generalization error by passing `CVSVMModel`

to `kfoldLoss`

.

### Specify Holdout Sample Proportion for Naive Bayes Cross-Validation

Specify a holdout sample proportion for cross-validation. By default, `crossval`

uses 10-fold cross-validation to cross-validate a naive Bayes classifier. However, you have several other options for cross-validation. For example, you can specify a different number of folds or a holdout sample proportion.

Load the `ionosphere`

data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad (`'b'`

) or good (`'g'`

).

`load ionosphere`

Remove the first two predictors for stability.

X = X(:,3:end); rng('default'); % For reproducibility

Train a naive Bayes classifier using the predictors `X`

and class labels `Y`

. A recommended practice is to specify the class names. `'b'`

is the negative class and `'g'`

is the positive class. `fitcnb`

assumes that each predictor is conditionally and normally distributed.

Mdl = fitcnb(X,Y,'ClassNames',{'b','g'});

`Mdl`

is a trained `ClassificationNaiveBayes`

classifier.

Cross-validate the classifier by specifying a 30% holdout sample.

`CVMdl = crossval(Mdl,'Holdout',0.3)`

CVMdl = ClassificationPartitionedModel CrossValidatedModel: 'NaiveBayes' PredictorNames: {'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14' 'x15' 'x16' 'x17' 'x18' 'x19' 'x20' 'x21' 'x22' 'x23' 'x24' 'x25' 'x26' 'x27' 'x28' 'x29' 'x30' 'x31' 'x32'} ResponseName: 'Y' NumObservations: 351 KFold: 1 Partition: [1x1 cvpartition] ClassNames: {'b' 'g'} ScoreTransform: 'none'

`CVMdl`

is a `ClassificationPartitionedModel`

cross-validated, naive Bayes classifier.

Display the properties of the classifier trained using 70% of the data.

TrainedModel = CVMdl.Trained{1}

TrainedModel = CompactClassificationNaiveBayes ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' DistributionNames: {1x32 cell} DistributionParameters: {2x32 cell}

`TrainedModel`

is a `CompactClassificationNaiveBayes`

classifier.

Estimate the generalization error by passing `CVMdl`

to `kfoldloss`

.

kfoldLoss(CVMdl)

ans = 0.2095

The out-of-sample misclassification error is approximately 21%.

Reduce the generalization error by choosing the five most important predictors.

idx = fscmrmr(X,Y); Xnew = X(:,idx(1:5));

Train a naive Bayes classifier for the new predictor.

Mdlnew = fitcnb(Xnew,Y,'ClassNames',{'b','g'});

Cross-validate the new classifier by specifying a 30% holdout sample, and estimate the generalization error.

```
CVMdlnew = crossval(Mdlnew,'Holdout',0.3);
kfoldLoss(CVMdlnew)
```

ans = 0.1429

The out-of-sample misclassification error is reduced from approximately 21% to approximately 14%.

### Create Cross-Validated Regression GAM Using `crossval`

Train a regression generalized additive model (GAM) by using `fitrgam`

, and create a cross-validated GAM by using `crossval`

and the holdout option. Then, use `kfoldPredict`

to predict responses for validation-fold observations using a model trained on training-fold observations.

Load the `patients`

data set.

`load patients`

Create a table that contains the predictor variables (`Age`

, `Diastolic`

, `Smoker`

, `Weight`

, `Gender`

, `SelfAssessedHealthStatus`

) and the response variable (`Systolic`

).

tbl = table(Age,Diastolic,Smoker,Weight,Gender,SelfAssessedHealthStatus,Systolic);

Train a GAM that contains linear terms for predictors.

`Mdl = fitrgam(tbl,'Systolic');`

`Mdl`

is a `RegressionGAM`

model object.

Cross-validate the model by specifying a 30% holdout sample.

rng('default') % For reproducibility CVMdl = crossval(Mdl,'Holdout',0.3)

CVMdl = RegressionPartitionedGAM CrossValidatedModel: 'GAM' PredictorNames: {'Age' 'Diastolic' 'Smoker' 'Weight' 'Gender' 'SelfAssessedHealthStatus'} CategoricalPredictors: [3 5 6] ResponseName: 'Systolic' NumObservations: 100 KFold: 1 Partition: [1x1 cvpartition] NumTrainedPerFold: [1x1 struct] ResponseTransform: 'none' IsStandardDeviationFit: 0

The `crossval`

function creates a `RegressionPartitionedGAM`

model object `CVMdl`

with the holdout option. During cross-validation, the software completes these steps:

Randomly select and reserve 30% of the data as validation data, and train the model using the rest of the data.

Store the compact, trained model in the

`Trained`

property of the cross-validated model object`RegressionPartitionedGAM`

.

You can choose a different cross-validation setting by using the `'CrossVal'`

, `'CVPartition'`

, `'KFold'`

, or `'Leaveout' `

name-value argument.

Predict responses for the validation-fold observations by using `kfoldPredict`

. The function predicts responses for the validation-fold observations by using the model trained on the training-fold observations. The function assigns `NaN`

to the training-fold observations.

yFit = kfoldPredict(CVMdl);

Find the validation-fold observation indexes, and create a table containing the observation index, observed response values, and predicted response values. Display the first eight rows of the table.

idx = find(~isnan(yFit)); t = table(idx,tbl.Systolic(idx),yFit(idx), ... 'VariableNames',{'Obseraction Index','Observed Value','Predicted Value'}); head(t)

Obseraction Index Observed Value Predicted Value _________________ ______________ _______________ 1 124 130.22 6 121 124.38 7 130 125.26 12 115 117.05 20 125 121.82 22 123 116.99 23 114 107 24 128 122.52

Compute the regression error (mean squared error) for the validation-fold observations.

L = kfoldLoss(CVMdl)

L = 43.8715

### Cross-Validate ECOC Classifier

Cross-validate an ECOC classifier with SVM binary learners, and estimate the generalized classification error.

Load Fisher's iris data set. Specify the predictor data `X`

and the response data `Y`

.

load fisheriris X = meas; Y = species; rng(1); % For reproducibility

Create an SVM template, and standardize the predictors.

`t = templateSVM('Standardize',true)`

t = Fit template for SVM. Standardize: 1

`t`

is an SVM template. Most of the template object properties are empty. When training the ECOC classifier, the software sets the applicable properties to their default values.

Train the ECOC classifier, and specify the class order.

Mdl = fitcecoc(X,Y,'Learners',t,... 'ClassNames',{'setosa','versicolor','virginica'});

`Mdl`

is a `ClassificationECOC`

classifier. You can access its properties using dot notation.

Cross-validate `Mdl`

using 10-fold cross-validation.

CVMdl = crossval(Mdl);

`CVMdl`

is a `ClassificationPartitionedECOC`

cross-validated ECOC classifier.

Estimate the generalized classification error.

genError = kfoldLoss(CVMdl)

genError = 0.0400

The generalized classification error is 4%, which indicates that the ECOC classifier generalizes fairly well.

## Input Arguments

`Mdl`

— Machine learning model

full regression model object | full classification model object

Machine learning model, specified as a full regression or classification model object, as given in the following tables of supported models.

**Regression Model Object**

Model | Full Regression Model Object |
---|---|

Gaussian process regression (GPR) model | `RegressionGP` (If you supply a custom
`ActiveSet` value in the call to
`fitrgp` , then you cannot cross-validate the GPR
model.) |

Generalized additive model (GAM) | `RegressionGAM` |

Neural network model | `RegressionNeuralNetwork` |

Regression ensemble model | `RegressionEnsemble` |

Support vector machine regression model | `RegressionSVM` |

Regression tree model | `RegressionTree` |

**Classification Model Object**

Model | Full Classification Model Object |
---|---|

Generalized additive model | `ClassificationGAM` |

k-nearest neighbor model | `ClassificationKNN` |

Naive Bayes model | `ClassificationNaiveBayes` |

Neural network model | `ClassificationNeuralNetwork` |

Support vector machine for one-class and binary classification | `ClassificationSVM` |

Discriminant analysis classifier | `ClassificationDiscriminant` |

Multiclass error-correcting output codes (ECOC) model | `ClassificationECOC` |

Ensemble classifier | `ClassificationEnsemble` |

Binary decision tree for multiclass classification | `ClassificationTree` |

### Name-Value Arguments

Specify optional pairs of arguments as
`Name1=Value1,...,NameN=ValueN`

, where `Name`

is
the argument name and `Value`

is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

*
Before R2021a, use commas to separate each name and value, and enclose*
`Name`

*in quotes.*

**Example: **`crossval(Mdl,KFold=3)`

specifies to use three folds in the
cross-validated model.

`CVPartition`

— Cross-validation partition

`[]`

(default) | `cvpartition`

object

Cross-validation partition, specified as a `cvpartition`

object that specifies the type of cross-validation and the
indexing for the training and validation sets.

To create a cross-validated model, you can specify only one of these four name-value
arguments: `CVPartition`

, `Holdout`

,
`KFold`

, or `Leaveout`

.

**Example: **Suppose you create a random partition for 5-fold cross-validation on 500
observations by using `cvp = cvpartition(500,KFold=5)`

. Then, you can
specify the cross-validation partition by setting
`CVPartition=cvp`

.

`Holdout`

— Fraction of data for holdout validation

scalar value in the range (0,1)

Fraction of the data used for holdout validation, specified as a scalar value in the range
[0,1]. If you specify `Holdout=p`

, then the software completes these
steps:

Randomly select and reserve

`p*100`

% of the data as validation data, and train the model using the rest of the data.Store the compact trained model in the

`Trained`

property of the cross-validated model.

To create a cross-validated model, you can specify only one of these four name-value
arguments: `CVPartition`

, `Holdout`

,
`KFold`

, or `Leaveout`

.

**Example: **`Holdout=0.1`

**Data Types: **`double`

| `single`

`KFold`

— Number of folds

`10`

(default) | positive integer value greater than 1

Number of folds to use in the cross-validated model, specified as a positive integer value
greater than 1. If you specify `KFold=k`

, then the software completes
these steps:

Randomly partition the data into

`k`

sets.For each set, reserve the set as validation data, and train the model using the other

`k`

– 1 sets.Store the

`k`

compact trained models in a`k`

-by-1 cell vector in the`Trained`

property of the cross-validated model.

To create a cross-validated model, you can specify only one of these four name-value
arguments: `CVPartition`

, `Holdout`

,
`KFold`

, or `Leaveout`

.

**Example: **`KFold=5`

**Data Types: **`single`

| `double`

`Leaveout`

— Leave-one-out cross-validation flag

`"off"`

(default) | `"on"`

Leave-one-out cross-validation flag, specified as `"on"`

or
`"off"`

. If you specify `Leaveout="on"`

, then for
each of the *n* observations (where *n* is the number
of observations, excluding missing observations, specified in the
`NumObservations`

property of the model), the software completes
these steps:

Reserve the one observation as validation data, and train the model using the other

*n*– 1 observations.Store the

*n*compact trained models in an*n*-by-1 cell vector in the`Trained`

property of the cross-validated model.

`CVPartition`

, `Holdout`

,
`KFold`

, or `Leaveout`

.

**Example: **`Leaveout="on"`

**Data Types: **`char`

| `string`

`NPrint`

— Printout frequency

`"off"`

(default) | positive integer

Printout frequency, specified as a positive integer or
`"off"`

.

To track the number of folds trained by the software so far, specify a positive
integer *m*. The software displays a message to the command line
every time it finishes training *m* folds.

If you specify `"off"`

, the software does not display a message
when it completes training folds.

**Note**

You can only specify `Nprint`

if `Mdl`

is
a `ClassificationEnsemble`

or `RegressionEnsemble`

model object.

**Example: **`NPrint=5`

**Data Types: **`single`

| `double`

| `char`

| `string`

`Options`

— Options for computing in parallel

structure

Options for computing in parallel, specified as a structure. Create the
`Options`

structure using `statset`

.

You need Parallel Computing Toolbox™ to run computations in parallel.

You can only specify `Options`

if `Mdl`

is a
`ClassificationECOC`

model object.

**Example: **`Options=statset(UseParallel=true)`

**Data Types: **`struct`

## Output Arguments

`CVMdl`

— Cross-validated machine learning model

cross-validated (partitioned) model object

Cross-validated machine learning model, returned as one of the cross-validated
(partitioned) model objects in the following tables, depending on the input model
`Mdl`

.

**Regression Model Object**

Model | Regression Model (`Mdl` ) | Cross-Validated Model (`CVMdl` ) |
---|---|---|

Gaussian process regression model | `RegressionGP` | `RegressionPartitionedGP` |

Generalized additive model | `RegressionGAM` | `RegressionPartitionedGAM` |

Neural network model | `RegressionNeuralNetwork` | `RegressionPartitionedNeuralNetwork` |

Regression ensemble model | `RegressionEnsemble` | `RegressionPartitionedEnsemble` |

Support vector machine regression model | `RegressionSVM` | `RegressionPartitionedSVM` |

Regression tree model | `RegressionTree` | `RegressionPartitionedModel` |

**Classification Model Object**

Model | Classification Model (`Mdl` ) | Cross-Validated Model (`CVMdl` ) |
---|---|---|

Generalized additive model | `ClassificationGAM` | `ClassificationPartitionedGAM` |

k-nearest neighbor model | `ClassificationKNN` | `ClassificationPartitionedModel` |

Naive Bayes model | `ClassificationNaiveBayes` | `ClassificationPartitionedModel` |

Neural network model | `ClassificationNeuralNetwork` | `ClassificationPartitionedModel` |

Support vector machine for one-class and binary classification | `ClassificationSVM` | `ClassificationPartitionedModel` |

Discriminant analysis classifier | `ClassificationDiscriminant` | `ClassificationPartitionedModel` |

Multiclass error-correcting output codes (ECOC) model | `ClassificationECOC` | `ClassificationPartitionedECOC` |

Ensemble classifier | `ClassificationEnsemble` | `ClassificationPartitionedEnsemble` |

Binary decision tree for multiclass classification | `ClassificationTree` | `ClassificationPartitionedModel` |

## Tips

Assess the predictive performance of

`Mdl`

on cross-validated data using the "kfold" functions and properties of`CVMdl`

, such as`kfoldPredict`

,`kfoldLoss`

,`kfoldMargin`

, and`kfoldEdge`

for classification and`kfoldPredict`

and`kfoldLoss`

for regression.Return a partitioned classifier with stratified partitioning by using the name-value argument

`'KFold'`

or`'Holdout'`

.Create a

`cvpartition`

object`cvp`

using`cvp =`

`cvpartition`

`(n,KFold=k)`

. Return a partitioned classifier with nonstratified partitioning by using the name-value argument`'CVPartition',cvp`

.

## Alternative Functionality

Instead of training a model and then cross-validating it, you can create a cross-validated
model directly by using a fitting function and specifying one of these name-value pair
arguments: `CVPartition`

, `Holdout`

,
`KFold`

, or `Leaveout`

.

## Extended Capabilities

### GPU Arrays

Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

This function fully supports GPU arrays for a trained classification model specified as a

`ClassificationKNN`

or`ClassificationSVM`

object.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

## Version History

**Introduced in R2012a**

### R2023b: A cross-validated regression neural network model is a `RegressionPartitionedNeuralNetwork`

object

Starting in R2023b, a cross-validated regression neural network model is a `RegressionPartitionedNeuralNetwork`

object. In previous releases, a cross-validated regression neural network model was a `RegressionPartitionedModel`

object.

You can create a `RegressionPartitionedNeuralNetwork`

object in two ways:

Create a cross-validated model from a regression neural network model object

`RegressionNeuralNetwork`

by using the`crossval`

object function.Create a cross-validated model by using the

`fitrnet`

function and specifying one of the name-value arguments`CrossVal`

,`CVPartition`

,`Holdout`

,`KFold`

, or`Leaveout`

.

### R2022b: A cross-validated Gaussian process regression model is a `RegressionPartitionedGP`

object

Starting in R2022b, a cross-validated Gaussian process regression (GPR) model is a `RegressionPartitionedGP`

object. In previous releases, a cross-validated GPR
model was a `RegressionPartitionedModel`

object.

You can create a `RegressionPartitionedGP`

object in two ways:

Create a cross-validated model from a GPR model object

`RegressionGP`

by using the`crossval`

object function.Create a cross-validated model by using the

`fitrgp`

function and specifying one of the name-value arguments`CrossVal`

,`CVPartition`

,`Holdout`

,`KFold`

, or`Leaveout`

.

Regardless of whether you train a full or cross-validated GPR model first, you cannot specify an `ActiveSet`

value in the call to `fitrgp`

.

## See Also

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)