kfoldLoss
Loss for crossvalidated partitioned regression model
Description
returns the loss with additional options specified by one or more namevalue arguments. For
example, you can specify a custom loss function.L
= kfoldLoss(CVMdl
,Name,Value
)
Examples
Find CrossValidation Loss for Regression Ensemble
Find the crossvalidation loss for a regression ensemble of the carsmall
data.
Load the carsmall
data set and select displacement, horsepower, and vehicle weight as predictors.
load carsmall
X = [Displacement Horsepower Weight];
Train an ensemble of regression trees.
rens = fitrensemble(X,MPG);
Create a crossvalidated ensemble from rens
and find the kfold crossvalidation loss.
rng(10,'twister') % For reproducibility cvrens = crossval(rens); L = kfoldLoss(cvrens)
L = 28.7114
Display Individual Losses for Each CrossValidation Fold
The mean squared error (MSE) is a measure of model quality. Examine the MSE for each fold of a crossvalidated regression model.
Load the carsmall
data set. Specify the predictor X
and the response data Y
.
load carsmall
X = [Cylinders Displacement Horsepower Weight];
Y = MPG;
Train a crossvalidated regression tree model. By default, the software implements 10fold crossvalidation.
rng('default') % For reproducibility CVMdl = fitrtree(X,Y,'CrossVal','on');
Compute the MSE for each fold. Visualize the distribution of the loss values by using a box plot. Notice that none of the values is an outlier.
losses = kfoldLoss(CVMdl,'Mode','individual')
losses = 10×1
42.5072
20.3995
22.3737
34.4255
40.8005
60.2755
19.5562
9.2060
29.0788
16.3386
boxchart(losses)
Find Optimal Number of Trees for GAM Using kfoldLoss
Train a crossvalidated generalized additive model (GAM) with 10 folds. Then, use kfoldLoss
to compute the cumulative crossvalidation regression loss (mean squared errors). Use the errors to determine the optimal number of trees per predictor (linear term for predictor) and the optimal number of trees per interaction term.
Alternatively, you can find optimal values of fitrgam
namevalue arguments by using the OptimizeHyperparameters namevalue argument. For an example, see Optimize GAM Using OptimizeHyperparameters.
Load the patients
data set.
load patients
Create a table that contains the predictor variables (Age
, Diastolic
, Smoker
, Weight
, Gender
, and SelfAssessedHealthStatus
) and the response variable (Systolic
).
tbl = table(Age,Diastolic,Smoker,Weight,Gender,SelfAssessedHealthStatus,Systolic);
Create a crossvalidated GAM by using the default crossvalidation option. Specify the 'CrossVal'
namevalue argument as 'on'
. Also, specify to include 5 interaction terms.
rng('default') % For reproducibility CVMdl = fitrgam(tbl,'Systolic','CrossVal','on','Interactions',5);
If you specify 'Mode'
as 'cumulative'
for kfoldLoss
, then the function returns cumulative errors, which are the average errors across all folds obtained using the same number of trees for each fold. Display the number of trees for each fold.
CVMdl.NumTrainedPerFold
ans = struct with fields:
PredictorTrees: [300 300 300 300 300 300 300 300 300 300]
InteractionTrees: [76 100 100 100 100 42 100 100 59 100]
kfoldLoss
can compute cumulative errors using up to 300 predictor trees and 42 interaction trees.
Plot the cumulative, 10fold crossvalidated, mean squared errors. Specify 'IncludeInteractions'
as false
to exclude interaction terms from the computation.
L_noInteractions = kfoldLoss(CVMdl,'Mode','cumulative','IncludeInteractions',false); figure plot(0:min(CVMdl.NumTrainedPerFold.PredictorTrees),L_noInteractions)
The first element of L_noInteractions
is the average error over all folds obtained using only the intercept (constant) term. The (J+1
)th element of L_noInteractions
is the average error obtained using the intercept term and the first J
predictor trees per linear term. Plotting the cumulative loss allows you to monitor how the error changes as the number of predictor trees in the GAM increases.
Find the minimum error and the number of predictor trees used to achieve the minimum error.
[M,I] = min(L_noInteractions)
M = 28.0506
I = 6
The GAM achieves the minimum error when it includes 5 predictor trees.
Compute the cumulative mean squared error using both linear terms and interaction terms.
L = kfoldLoss(CVMdl,'Mode','cumulative'); figure plot(0:min(CVMdl.NumTrainedPerFold.InteractionTrees),L)
The first element of L
is the average error over all folds obtained using the intercept (constant) term and all predictor trees per linear term. The (J+1
)th element of L
is the average error obtained using the intercept term, all predictor trees per linear term, and the first J
interaction trees per interaction term. The plot shows that the error increases when interaction terms are added.
If you are satisfied with the error when the number of predictor trees is 5, you can create a predictive model by training the univariate GAM again and specifying 'NumTreesPerPredictor',5
without crossvalidation.
Input Arguments
CVMdl
— Crossvalidated partitioned regression model
RegressionPartitionedModel
object  RegressionPartitionedEnsemble
object  RegressionPartitionedGAM
object  RegressionPartitionedGP
object  RegressionPartitionedNeuralNetwork
object  RegressionPartitionedSVM
object
Crossvalidated partitioned regression model, specified as a RegressionPartitionedModel
, RegressionPartitionedEnsemble
, RegressionPartitionedGAM
, RegressionPartitionedGP
, RegressionPartitionedNeuralNetwork
, or RegressionPartitionedSVM
object. You can create the object in two ways:
Pass a trained regression model listed in the following table to its
crossval
object function.Train a regression model using a function listed in the following table and specify one of the crossvalidation namevalue arguments for the function.
Regression Model  Function 

RegressionEnsemble  fitrensemble 
RegressionGAM  fitrgam 
RegressionGP  fitrgp 
RegressionNeuralNetwork  fitrnet 
RegressionSVM  fitrsvm 
RegressionTree  fitrtree 
NameValue Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Namevalue arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: kfoldLoss(CVMdl,'Folds',[1 2 3 5])
specifies to use the
first, second, third, and fifth folds to compute the mean squared error, but to exclude the
fourth fold.
Folds
— Fold indices to use
1:CVMdl.KFold
(default)  positive integer vector
Fold indices to use, specified as a positive integer vector. The elements of Folds
must be within the range from 1
to CVMdl.KFold
.
The software uses only the folds specified in Folds
.
Example: 'Folds',[1 4 10]
Data Types: single
 double
IncludeInteractions
— Flag to include interaction terms
true
 false
Flag to include interaction terms of the model, specified as true
or
false
. This argument is valid only for a generalized
additive model (GAM). That is, you can specify this argument only when
CVMdl
is RegressionPartitionedGAM
.
The default value is true
if the models in
CVMdl
(CVMdl.Trained
) contain
interaction terms. The value must be false
if the models do not
contain interaction terms.
Example: 'IncludeInteractions',false
Data Types: logical
LossFun
— Loss function
'mse'
(default)  function handle
Loss function, specified as 'mse'
or a function handle.
Specify the builtin function
'mse'
. In this case, the loss function is the mean squared error.Specify your own function using function handle notation.
Assume that n is the number of observations in the training data (
CVMdl.NumObservations
). Your function must have the signaturelossvalue =
, where:lossfun
(Y,Yfit,W)The output argument
lossvalue
is a scalar.You specify the function name (
lossfun
).Y
is an nby1 numeric vector of observed responses.Yfit
is an nby1 numeric vector of predicted responses.W
is an nby1 numeric vector of observation weights.
Specify your function using
'LossFun',@
.lossfun
Data Types: char
 string
 function_handle
Mode
— Aggregation level for output
'average'
(default)  'individual'
 'cumulative'
Aggregation level for the output, specified as 'average'
,
'individual'
, or 'cumulative'
.
Value  Description 

'average'  The output is a scalar average over all folds. 
'individual'  The output is a vector of length k containing one value per fold, where k is the number of folds. 
'cumulative'  Note If you want to specify this value,

Example: 'Mode','individual'
PredictionForMissingValue
— Predicted response value to use for observations with missing predictor values
"median"
 "mean"
 "omitted"
 numeric scalar
Since R2023b
Predicted response value to use for observations with missing predictor values,
specified as "median"
, "mean"
,
"omitted"
, or a numeric scalar. This argument is valid only for a
Gaussian process regression, neural network, or support vector machine model. That is,
you can specify this argument only when CVMdl
is a
RegressionPartitionedGP
,
RegressionPartitionedNeuralNetwork
, or
RegressionPartitionedSVM
object.
Value  Description 

"median" 
This value is
the default when 
"mean"  kfoldLoss uses the mean of the observed response
values in the trainingfold data as the predicted response value for
observations with missing predictor values. 
"omitted"  kfoldLoss excludes observations with missing
predictor values from the loss computation. 
Numeric scalar  kfoldLoss uses this value as the predicted
response value for observations with missing predictor values. 
If an observation is missing an observed response value or an observation weight,
then kfoldLoss
does not use the observation in the loss
computation.
Example: "PredictionForMissingValue","omitted"
Data Types: single
 double
 char
 string
Output Arguments
L
— Loss
numeric scalar  numeric column vector
Loss, returned as a numeric scalar or numeric column vector.
By default, the loss is the mean squared error between the validationfold observations and the predictions made with a regression model trained on the trainingfold observations.
If
Mode
is'average'
, thenL
is the average loss over all folds.If
Mode
is'individual'
, thenL
is a kby1 numeric column vector containing the loss for each fold, where k is the number of folds.If
Mode
is'cumulative'
andCVMdl
isRegressionPartitionedEnsemble
, thenL
is amin(CVMdl.NumTrainedPerFold)
by1 numeric column vector. Each elementj
is the average loss over all folds that the function obtains using ensembles trained with weak learners1:j
.If
Mode
is'cumulative'
andCVMdl
isRegressionPartitionedGAM
, then the output value depends on theIncludeInteractions
value.If
IncludeInteractions
isfalse
, thenL
is a(1 + min(NumTrainedPerFold.PredictorTrees))
by1 numeric column vector. The first element ofL
is the average loss over all folds that is obtained using only the intercept (constant) term. The(j + 1)
th element ofL
is the average loss obtained using the intercept term and the firstj
predictor trees per linear term.If
IncludeInteractions
istrue
, thenL
is a(1 + min(NumTrainedPerFold.InteractionTrees))
by1 numeric column vector. The first element ofL
is the average loss over all folds that is obtained using the intercept (constant) term and all predictor trees per linear term. The(j + 1)
th element ofL
is the average loss obtained using the intercept term, all predictor trees per linear term, and the firstj
interaction trees per interaction term.
Alternative Functionality
If you want to compute the crossvalidated loss of a tree model, you can avoid
constructing a RegressionPartitionedModel
object by calling cvloss
. Creating a crossvalidated tree object can save you time if you plan to
examine it more than once.
Extended Capabilities
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
Usage notes and limitations:
This function fully supports GPU arrays for the following models.
RegressionPartitionedModel
object fitted usingfitrtree
, or by passing aRegressionTree
object tocrossval
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Version History
Introduced in R2011aR2024b: Specify GPU arrays for neural network models (requires Parallel Computing Toolbox)
kfoldLoss
fully supports GPU arrays for RegressionPartitionedNeuralNetwork
models.
R2023b: Specify predicted response value to use for observations with missing predictor values
Starting in R2023b, when you predict or compute the loss, some regression models allow you to specify the predicted response value for observations with missing predictor values. Specify the PredictionForMissingValue
namevalue argument to use a numeric scalar, the training set median, or the training set mean as the predicted value. When computing the loss, you can also specify to omit observations with missing predictor values.
This table lists the object functions that support the
PredictionForMissingValue
namevalue argument. By default, the
functions use the training set median as the predicted response value for observations with
missing predictor values.
Model Type  Model Objects  Object Functions 

Gaussian process regression (GPR) model  RegressionGP , CompactRegressionGP  loss , predict , resubLoss , resubPredict 
RegressionPartitionedGP  kfoldLoss , kfoldPredict  
Gaussian kernel regression model  RegressionKernel  loss , predict 
RegressionPartitionedKernel  kfoldLoss , kfoldPredict  
Linear regression model  RegressionLinear  loss , predict 
RegressionPartitionedLinear  kfoldLoss , kfoldPredict  
Neural network regression model  RegressionNeuralNetwork , CompactRegressionNeuralNetwork  loss , predict , resubLoss , resubPredict 
RegressionPartitionedNeuralNetwork  kfoldLoss , kfoldPredict  
Support vector machine (SVM) regression model  RegressionSVM , CompactRegressionSVM  loss , predict , resubLoss , resubPredict 
RegressionPartitionedSVM  kfoldLoss , kfoldPredict 
In previous releases, the regression model loss
and predict
functions listed above used NaN
predicted response values for observations with missing predictor values. The software omitted observations with missing predictor values from the resubstitution ("resub") and crossvalidation ("kfold") computations for prediction and loss.
R2023a: GPU support for RegressionPartitionedSVM
models
Starting in R2023a, kfoldLoss
fully supports GPU arrays for RegressionPartitionedSVM
models.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
 América Latina (Español)
 Canada (English)
 United States (English)
Europe
 Belgium (English)
 Denmark (English)
 Deutschland (Deutsch)
 España (Español)
 Finland (English)
 France (Français)
 Ireland (English)
 Italia (Italiano)
 Luxembourg (English)
 Netherlands (English)
 Norway (English)
 Österreich (Deutsch)
 Portugal (English)
 Sweden (English)
 Switzerland
 United Kingdom (English)