Compute standard deviations of predictions of linear and polynomial regression models

27 views (last 30 days)
I'm using the fit and fitlm functions to fit various linear and polynomial regression models, and then using predict and predint to compute predictions of the response variable with lower/upper confidence intervals as in the example below. However, I also want to calculate standard deviations, y_sigma, of the predictions. Is there an easy way to do that?
% Some data
X = [239.38 254.46 266.06 269.20 277.59]';
Y = [194.72 201.03 206.94 209.58 212.32]';
% Make predictions at
x = linspace(230, 290, 13)';
Linear regression
% Fit LR model
model = fitlm(X, Y);
% Make prediction at new points
[y_mean, y_int] = predict(model, x, 'Alpha', 0.1);
Fit polynomial (e.g. cubic)
% Fit polynomial model
fit_type = "poly3";
[model, gof, output] = fit(X, Y, fit_type);
% Make prediction at new points
[y_int, y_mean] = predint(model, x, 0.9, 'Observation', 'off');
For comparison, a Gaussian process model can produce std. deviations y_sigma as follows:
% Fit GPR model
model = fitrgp(X, Y)
% Make prediction at new points
[y_mean, y_sigma, y_int] = predict(model, x, 'Alpha', 0.1);
I want to make sure that y_mean, y_sigma, y_int are comparable in all three cases above (i.e. compare "apples with apples").
  2 Comments
Bill Tubbs
Bill Tubbs on 12 Apr 2023
For context, the reason for wanting the standard deviations at each x, is to produce an overall measure of the 'total uncertainty' in the predictions of the model over the full range of x. In my application, I am simply defining 'total uncertainty' as follows
total_uncertainty = mean(y_sigma)
the cyclist
the cyclist on 12 Apr 2023
This is misguided. y_sigma is not a measure of uncertainty. See my comment on your answer.
In an OLS model the RMSE is a measure of average uncertainty in model prediction. (I guess you could multiply that by some arbitrary number of points to call it a "total" uncertainty, but that seems silly.)
I'd need to dig a little to see if there is an equivalent average uncertainty measure for the gaussian process model. But, I'm not sure it's ever truly an "apples to apples" comparison between a two-parameter model and a one-parameter model.

Sign in to comment.

Answers (2)

the cyclist
the cyclist on 11 Apr 2023
I am by no means an expert on gaussian process models, but I don't think that an ordinary least squares regression (fitlm) has the equivalent parameter to y_sigma here. I don't mean that it isn't reported -- I mean it does not exist. If one thinks about the underlying assumptions of the generative process of the data, there is no y_sigma equivalent.
I won't be surprised if someone here answers by calculating the standard deviation of something coming out of the fitlm. But I'd be wary that it is the statistical equivalent of what you want.
  2 Comments
Bill Tubbs
Bill Tubbs on 11 Apr 2023
Edited: Bill Tubbs on 11 Apr 2023
Yeah I think you might be right. But then, since there is a confidence interval, and if it is based on a Gaussian distribution, then surely there must be a std. deviation as well? But I also don't have the statistical background to know if that is true.
the cyclist
the cyclist on 12 Apr 2023
In OLS the parameter has error, and that error has a gaussian distribution.
That is different from saying you have two parameters (mean and standard deviation of a gaussian) to fit.
Gaussians abound in parameterized statistics.

Sign in to comment.


Bill Tubbs
Bill Tubbs on 12 Apr 2023
Edited: Bill Tubbs on 12 Apr 2023
Looking at the linear regression example, the mean predictions are in fact exactly half-way between the confidence intervals,
assert(all(abs(mean(y_int, 2) - y_mean) < 1e-12))
plot(X, Y, 'b+'); hold on
plot(x, y_int(:, 1), 'r--', x, y_int(:, 2), 'r--', x, y_mean, 'b-');
grid on; xlabel('x')
legend(["data", "LCB", "UCB", "mean"], 'Location', 'best')
Plot of linear model predictions and confidence intervals
If that is the case, can I simply calculate the std. deviations of the prediction errors as follows?
sd = norminv(0.5 + (1 - 0.1) / 2);
y_sigma = diff(y_int, [], 2) ./ (2 * sd);
plot(x, y_sigma); grid on;
xlabel('x'); ylabel('sigma'); ylim([0 inf])
Could someone with a good grounding in statistics tell me if this is a valid calculation of y_sigma?
  4 Comments
Jeff Miller
Jeff Miller on 13 Apr 2023
  1. "I thought the lower and upper confidence bounds produced during the fitting of the linear model (y_int above) reflected the uncertainty of the model predictions at the new points (x)." The bounds that you get from predict are bounds on where the curve might be--that is, bounds on what the true mean Y might be at each X. ((These are the red dashed lines in one of the plots you posted.) If you want bounds on the actual observed individual values of Y (i.e., bounds on observed Y data points), then you have to call predict with the optional parameter pair 'prediction','observation'. The bounds will be wider in the latter case, because there is more uncertainty about the location of the observations than about the location of their mean. I'm not entirely sure which kind of uncertainty you want to measure.
  2. "can I simply calculate the std. deviations of the prediction errors as follows?" No. The confidence intervals are computed from t distributions rather than normals (e.g., CIs for observations). You could use the same approach using the t distribution rather than the normal (model has the required dferror).
  3. Whichever type of prediction error you settle on for the linear and polynomial regression models, perhaps you can get estimate the comparable type of prediction error for the gaussian process model using a bootstrapping approach. (Disclaimer: I know nothing about these models.) The basic idea would be to repeatedly (a) take a bootstrap sample of your data, (b) fit the gpm to that sample, ( c) compute a predicted Y' from that gpm (whatever that means in terms of a gpm). Across iterations of this process, you can certainly look at the bootstrap standard error of those Y' values. I have no idea whether that would provide an "apples vs apples" comparison with either of the linear model variability measures, but maybe @the cyclist can shed some light on that. In any case, from his earlier replies I gather that y_sigma does not index prediction uncertainty in either of the senses that apply with the linear/poly models.
the cyclist
the cyclist on 13 Apr 2023
@Bill Tubbs, your understanding of the uncertainty coming out of fitlm is largely correct. y_int is indeed a measure of uncertainty in the prediction.
y_sigma, coming out of the fitrgp prediction, is not a measure of uncertainty. It is a fitted parameter.
Note that fitrgp also outputs a value for y_int, which is "prediction intervals of the response variable".
Here is my take, for "apples to apples". If your two models are
% LR model
model = fitlm(X, Y);
[y_mean, y_int] = predict(model, x, 'Alpha', 0.1)
% GPR model
model = fitrgp(X, Y);
[y_mean, y_sigma, y_int] = predict(model, x, 'Alpha', 0.1)
then
  • In both models, y_mean is the predicted response.
  • In both models, y_int is a measure of the uncertainty in that prediction.
  • In the gaussian process model, y_sigma is the predicted standard deviation of the response (which is NOT an uncertainty measure). There is NO equivalent of y_sigma in the linear model.

Sign in to comment.

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!