question about k means clustering
Show older comments
How can we figure out a data set using all columns of a dataset with k=2 means clustering? Data set is here: https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/
7 Comments
KALYAN ACHARJYA
on 3 Jan 2021
Edited: KALYAN ACHARJYA
on 3 Jan 2021
What is the problem? Issue with dataset or k-means?
Note: if you want help, then you need to make it easy to be helped.
Eeengineer
on 3 Jan 2021
Image Analyst
on 3 Jan 2021
I saved "hepatitis.data" at that web site and it didn't work
load('hepatitis.data')
X=hepatitis(:,16:17);
figure;
plot(X,'k*');
title 'Hepatitis Data';
hold on;
opts = statset('Display','final');
[idx,C] = kmeans(X,2,'Distance','sqeuclidean',...
'Replicates',5,'Options',opts);
Please post the actual data file and code that actually works with it.
Image Analyst
on 3 Jan 2021
Doesn't run. load doesn't work. You're not making it easy for us, are you? I'll try to fix it. In the meantime, edit yoru post and format your code as code by highlighting and clicking the code icon.
Image Analyst
on 3 Jan 2021
Come on Eeengineer. Please don't waste my time when I try to help you. I used xlsread() instead of load() and that got the data in, but there is no 17th column. Please fix or post your actual code. I'm going to do other stuff now and I'll check back later.
clear all;
close all;
clc;
format long g;
format compact;
fontSize = 15;
fprintf('Beginning to run %s.m ...\n', mfilename);
hepatitis = xlsread('hepatitis.xlsx')
X = hepatitis(:,16:17)
plot(X,'k*');
title 'Hepatitis Data';
hold on;
idx=kmeans(X,2);
opts = statset('Display','final');
[idx,C] = kmeans(X,2,'Distance','sqeuclidean',...
'Replicates',5,'Options',opts);
figure;
plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(C(:,1),C(:,2),'kx',...
'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title 'Cluster Assignments and Centroids'
hold off
Image Analyst
on 3 Jan 2021
Only columns 2 and 15 look like there is any real data in them. The rest of the columns just have 1, 2, or nan in them. Which columns do you want to take as "observations"? Are all of them observations, or just the columns 2 and 15?
If I scatter columns 1 and 2 and 15, I see this:
hepatitis = xlsread('hepatitis.xlsx')
x = hepatitis(:,1);
y = hepatitis(:, 2);
z = hepatitis(:, 15);
scatter3(x, y, z, 'Filled');
title('Hepatitis Data', 'FontSize', 20);
xlabel('Column 1', 'FontSize', 20);
ylabel('Column 2', 'FontSize', 20);
zlabel('Column 15', 'FontSize', 20);

So where are the clusters? If you're going to include columns 1 and 3-14, and 16 in the observations, then the clusters might be dominated by what's in those columns since they're very discrete - either 1 or 2. Looking at just columns 2 and 15, it doesn't look like there are any meaningful clusters.
Eeengineer
on 3 Jan 2021
Answers (2)
Eeengineer
on 3 Jan 2021
0 votes
Eeengineer
on 3 Jan 2021
0 votes
Categories
Find more on k-Means and k-Medoids Clustering in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!