How to Cluster Dataset and remove outlier in MATLAB
26 views (last 30 days)
Show older comments
Hello, I have the following dataset, In which i have four features in each column.
I want to cluster Dataset. I have go through K-means it required Number of clusters as input.
2 Comments
Answers (2)
Sai Pavan
on 17 Apr 2024
Hello,
I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:
- Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
- Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
- Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.
Please refer to the below code snippet that illustrates the above workflow:
data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
[idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
Hope it helps!
2 Comments
Walter Roberson
on 24 Apr 2024
Moved: Walter Roberson
on 24 Apr 2024
load Question
dataset1=data(:,[2 4]);
dataset1 is created from 2 columns of data
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
C is created from dataset1 so it has two columns
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
data has 6 columns, so clusterPoints has 6 columns
centroid = C(i, :);
centroid is created from C so it has two columns
whos clusterPoints centroid
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
You are trying to subtract something with 2 columns from something with 6 columns, which is an error
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!