Using "Pearson correlation coefficient" in the 'Pdist' for the Clustergram function

I am doing the Hierarchical cluster analysis. I constructed the dendrograms by the 'clustergram' using agglomerative average-linkage clustering. But I need to use 'pearson correlation coeffiecient' for the distance matric(the default distance matric is euclidean, there are other matric availdable as well), but I could not find the pearson correlation available in the 'PDist' function. Thanks for helping.

 Accepted Answer

The 'correlation' option for pdist uses the pearson correlation (documentation page here). So to use this metric to calculate the distance between columns in clustergram, you can call:
clustergram(..., "ColumnPDist", "correlation")
The same option is available for "RowPDist" as well.

5 Comments

Hi Ronquist, thanks for helping me with this problem. I just got confused. Because I saw the documentation page, the formula of the correlation looks different from the formular posted in the pearson correlation Corr function. I am trying to understand it, could you please help me? Thanks again.
The following one is what I found in the clustergram function documentation
The next one is what I found in the pearson correlation page
(https://www.mathworks.com/help/stats/corr.html#mw_1b19e0d5-7906-4577-a0a5-b20311da7faf)
Good question, let me clarify. The equations are written in different notations, but are very similar. Essentially, correlation distance (d) is one minus correlation ():
This is because pdist outputs distances, and distances must be non-negative. Since correlation is bounded from -1 to 1, will ensure that the correlation distance is always greater than or equal to zero.
You can double check this by running pdist(X', "correlation") and 1-corr(X) on some matrix X
X = rand(5,2);
pdist(X', "correlation")
1-corr(X)
I run the coding above. I got the same answers. I am sorry I am new to Matlab. May I ask why when calculate the pdist we use X', instead of X. Thanks, but I understood the correlation distance (d) in pdist is based on the pearson correlation now. Thank you so much.
The apostrophe is used to transpose the matrix X.
corr calculates the correlation between the columns of the input matrix. pdist calculates the distance between the rows of the input matrix. The apostrophe operator computes the complex conjugate transpose of X. When the values of X are all real numbers (as is the case here), this is the same as the basic transpose function.
X' can be replaced with transpose(X) in the code from above:
X = rand(5,2);
pdist(transpose(X), "correlation")
1-corr(X)
The penny dropped. Thank you so much Ronquist for your help, patient and hardworking.

Sign in to comment.

More Answers (0)

Products

Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!