Compute standard deviations of predictions of linear and polynomial regression models

Question

Bill Tubbs on 11 Apr 2023

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1945524-compute-standard-deviations-of-predictions-of-linear-and-polynomial-regression-models

Commented: the cyclist on 13 Apr 2023

I'm using the fit and fitlm functions to fit various linear and polynomial regression models, and then using predict and predint to compute predictions of the response variable with lower/upper confidence intervals as in the example below. However, I also want to calculate standard deviations, y_sigma, of the predictions. Is there an easy way to do that?

% Some data
X = [239.38 254.46 266.06 269.20 277.59]';
Y = [194.72 201.03 206.94 209.58 212.32]';
% Make predictions at
x = linspace(230, 290, 13)';

Linear regression

% Fit LR model
model = fitlm(X, Y);
% Make prediction at new points
[y_mean, y_int] = predict(model, x, 'Alpha', 0.1);

Fit polynomial (e.g. cubic)

% Fit polynomial model
fit_type = "poly3";
[model, gof, output] = fit(X, Y, fit_type);
% Make prediction at new points
[y_int, y_mean] = predint(model, x, 0.9, 'Observation', 'off');

For comparison, a Gaussian process model can produce std. deviations y_sigma as follows:

% Fit GPR model
model = fitrgp(X, Y)
% Make prediction at new points
[y_mean, y_sigma, y_int] = predict(model, x, 'Alpha', 0.1);

I want to make sure that y_mean, y_sigma, y_int are comparable in all three cases above (i.e. compare "apples with apples").

2 Comments
Show NoneHide None

Bill Tubbs on 12 Apr 2023

Open in MATLAB Online

For context, the reason for wanting the standard deviations at each x, is to produce an overall measure of the 'total uncertainty' in the predictions of the model over the full range of x. In my application, I am simply defining 'total uncertainty' as follows

total_uncertainty = mean(y_sigma)

the cyclist on 12 Apr 2023

This is misguided. y_sigma is not a measure of uncertainty. See my comment on your answer.

In an OLS model the RMSE is a measure of average uncertainty in model prediction. (I guess you could multiply that by some arbitrary number of points to call it a "total" uncertainty, but that seems silly.)

I'd need to dig a little to see if there is an equivalent average uncertainty measure for the gaussian process model. But, I'm not sure it's ever truly an "apples to apples" comparison between a two-parameter model and a one-parameter model.

Sign in to comment.

Sign in to answer this question.

Answer 1

the cyclist on 11 Apr 2023

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/1945524-compute-standard-deviations-of-predictions-of-linear-and-polynomial-regression-models#answer_1213959

I am by no means an expert on gaussian process models, but I don't think that an ordinary least squares regression (fitlm) has the equivalent parameter to y_sigma here. I don't mean that it isn't reported -- I mean it does not exist. If one thinks about the underlying assumptions of the generative process of the data, there is no y_sigma equivalent.

I won't be surprised if someone here answers by calculating the standard deviation of something coming out of the fitlm. But I'd be wary that it is the statistical equivalent of what you want.

2 Comments
Show NoneHide None

Bill Tubbs on 11 Apr 2023

Edited: Bill Tubbs on 11 Apr 2023

Yeah I think you might be right. But then, since there is a confidence interval, and if it is based on a Gaussian distribution, then surely there must be a std. deviation as well? But I also don't have the statistical background to know if that is true.

the cyclist on 12 Apr 2023

In OLS the parameter has error, and that error has a gaussian distribution.

That is different from saying you have two parameters (mean and standard deviation of a gaussian) to fit.

Gaussians abound in parameterized statistics.

Sign in to comment.

Answer 2

Bill Tubbs on 12 Apr 2023

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/1945524-compute-standard-deviations-of-predictions-of-linear-and-polynomial-regression-models#answer_1214739

Edited: Bill Tubbs on 12 Apr 2023

Open in MATLAB Online

Looking at the linear regression example, the mean predictions are in fact exactly half-way between the confidence intervals,

assert(all(abs(mean(y_int, 2) - y_mean) < 1e-12))
plot(X, Y, 'b+'); hold on
plot(x, y_int(:, 1), 'r--', x, y_int(:, 2), 'r--', x, y_mean, 'b-'); 
grid on; xlabel('x')
legend(["data", "LCB", "UCB", "mean"], 'Location', 'best')

Plot of linear model predictions and confidence intervals

If that is the case, can I simply calculate the std. deviations of the prediction errors as follows?

sd = norminv(0.5 + (1 - 0.1) / 2);
y_sigma = diff(y_int, [], 2) ./ (2 * sd);
plot(x, y_sigma); grid on; 
xlabel('x'); ylabel('sigma'); ylim([0 inf])

Could someone with a good grounding in statistics tell me if this is a valid calculation of y_sigma?

4 Comments
Show 2 older commentsHide 2 older comments

the cyclist on 12 Apr 2023

Open in MATLAB Online

I think you are misunderstanding a fundamental concept here, particularly about the gaussion process model.

After you estimate an OLS model of the form

Y = alpha + beta*X

then for any value x, you predict of a single value of y. This is the prediction of y, given x. (In your original question, you do this for several values of x, and call the result y_mean. I am trying to stick close to your notation.) There is uncertainty in the prediction of y_mean.

After you fit the gaussian process model, for each value of x, you do not predict a single value of y. Rather, you predict a gaussian for that x location. You predict N(y_mean,y_sigma). In effect, you have made two predictions: A prediction of y_mean, and a prediction of y_sigma. There is uncertainty in both of those predictions. y_sigma is NOT a measure of the uncertainty in y_mean. It is its own parameter.

As I emphasized in my own answer, there is no equivalent of y_sigma in the OLS model. (There is no predicted gaussian.)

Jeff Miller on 13 Apr 2023

@Bill Tubbs:

"I thought the lower and upper confidence bounds produced during the fitting of the linear model (y_int above) reflected the uncertainty of the model predictions at the new points (x)." The bounds that you get from predict are bounds on where the curve might be--that is, bounds on what the true mean Y might be at each X. ((These are the red dashed lines in one of the plots you posted.) If you want bounds on the actual observed individual values of Y (i.e., bounds on observed Y data points), then you have to call predict with the optional parameter pair 'prediction','observation'. The bounds will be wider in the latter case, because there is more uncertainty about the location of the observations than about the location of their mean. I'm not entirely sure which kind of uncertainty you want to measure.
"can I simply calculate the std. deviations of the prediction errors as follows?" No. The confidence intervals are computed from t distributions rather than normals (e.g., CIs for observations). You could use the same approach using the t distribution rather than the normal (model has the required dferror).
Whichever type of prediction error you settle on for the linear and polynomial regression models, perhaps you can get estimate the comparable type of prediction error for the gaussian process model using a bootstrapping approach. (Disclaimer: I know nothing about these models.) The basic idea would be to repeatedly (a) take a bootstrap sample of your data, (b) fit the gpm to that sample, ( c) compute a predicted Y' from that gpm (whatever that means in terms of a gpm). Across iterations of this process, you can certainly look at the bootstrap standard error of those Y' values. I have no idea whether that would provide an "apples vs apples" comparison with either of the linear model variability measures, but maybe @the cyclist can shed some light on that. In any case, from his earlier replies I gather that y_sigma does not index prediction uncertainty in either of the senses that apply with the linear/poly models.

the cyclist on 13 Apr 2023

Open in MATLAB Online

@Bill Tubbs, your understanding of the uncertainty coming out of fitlm is largely correct. y_int is indeed a measure of uncertainty in the prediction.

y_sigma, coming out of the fitrgp prediction, is not a measure of uncertainty. It is a fitted parameter.

Note that fitrgp also outputs a value for y_int, which is "prediction intervals of the response variable".

Here is my take, for "apples to apples". If your two models are

% LR model
model = fitlm(X, Y);
[y_mean, y_int] = predict(model, x, 'Alpha', 0.1)
% GPR model
model = fitrgp(X, Y);
[y_mean, y_sigma, y_int] = predict(model, x, 'Alpha', 0.1)

then

In both models, y_mean is the predicted response.
In both models, y_int is a measure of the uncertainty in that prediction.
In the gaussian process model, y_sigma is the predicted standard deviation of the response (which is NOT an uncertainty measure). There is NO equivalent of y_sigma in the linear model.

Sign in to comment.

Compute standard deviations of predictions of linear and polynomial regression models

2 Comments
Show NoneHide None

Answers (2)

2 Comments
Show NoneHide None

4 Comments
Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Compute standard deviations of predictions of linear and polynomial regression models

2 Comments Show NoneHide None

Answers (2)

2 Comments Show NoneHide None

4 Comments Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None

4 Comments
Show 2 older commentsHide 2 older comments