# The calculated R squared is not equal to the squared of correlation coefficient by Matlab functions corr

262 views (last 30 days)
Yuzhen Lu on 23 Apr 2020
Edited: John D'Errico on 24 Apr 2020
With model predicitons and true values, the R2 (determiantion coefficient) can be readily calculated using the standard formula:
Rsq = 1 - sum((ytrue - ypred).^2)/sum((ytrue - mean(ytrue)).^2)
Alternativley, the R square can be obtained by calculating the correlation coefficient, using buildin functions such as corr or corrcoeff:
Rsq = (corr(ytrue,ypred))^2
However, it is found the latter value is sligherly larger than the former. How does the build-in function give a higher value?
dpb on 24 Apr 2020
Altho they're not the sme calculation

Ameer Hamza on 23 Apr 2020
You are trying to find the coefficient of determination(R-squared). Whereas, as shown in the documentation of corr(): https://www.mathworks.com/help/releases/R2020a/stats/corr.html#d120e195813 it calculates Pearson's linear correlation coefficient. I am not sure if any MATLAB's built-in function supports its direct calculation, however, I found this submission on FEX: https://www.mathworks.com/matlabcentral/fileexchange/34492-r-square-the-coefficient-of-determination. Internally, it implements the same formula as you are using right now.

John D'Errico on 24 Apr 2020
Edited: John D'Errico on 24 Apr 2020
What I do not see is the actual model you used. Did you use a linear model? Was there a constant term in the model? The problem is, depending on the model, the claims you make about R^2 and the correlation coefficient are only valid for specific models.
x = rand(10,1);
>> y = rand(10,1);
>> p2 = polyfit(x,y,2);
>> pred = polyval(p2,x);
>> Rsq = 1 - sum((y - pred).^2)/sum((y - mean(y)).^2)
Rsq =
0.140274350649466
>> corr(y,pred).^2
ans =
0.140274350649466
So, the square of the correlation coefficient is the same as the value your formula computes. It matches down to the last digit, which is my expectation.
However, now try the same thing, but using a model that has no constant term in it. In this case, I'll use a cubic polynomial fit, but one that has no constant term. We can do that using backslash, though I could have done the fit using any number of tools.
mdl = [x,x.^2,x.^3]\y
mdl =
0.552026949387604
3.2235169295382
-3.50451900695301
>> pred = [x,x.^2,x.^3]*mdl;
>> Rsq = 1 - sum((y - pred).^2)/sum((y - mean(y)).^2)
Rsq =
0.195980323024559
>> corr(y,pred).^2
ans =
0.200698709640219
What was wrong? The error is in the assumption that the two ways compute the same thing for models that have no constant term estimated.
There are adjusted R^2 computations that can be more accurate in these cases, but even so, there is no expectation the formulas will give the same result any longer, when the model lacks a constant term.